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ABSTRACT 


Architectural  and  Circuit 
Issues  for  a  High  Clock 
Rate  Floating-Point 
Processor 


by 

Thomas  Richard  Huff 


Chair:  Professor  Richard  B.  Brown 


This  dissertation  examines  the  issues  confronting  the  designer  of  floating-point 
units  for  high-performance  microprocessors.  Sophisticated  hardware  coprocessors  for 
floating-point  arithmetic  have  been  pursued  primarily  within  the  past  decade.  The  develop¬ 
ment  of  these  coprocessors  parallels  that  of  integer  processors;  initially  simple  designs 
were  altered  to  satisfy  the  demand  for  increased  performance.  Architectural  optimizations 
and  technology  improvements  have  had  the  greatest  effect  on  performance.  This  work  will 
examine  these  issues  specifically  by  determining  the  mechanisms  through  which  a  floating¬ 
point  unit  can  stall  instruction  execution,  and  by  describing  the  implementation  and  verifi¬ 
cation  of  a  GaAs  floating-point  design.  This  dissertation  represents  a  unique,  comprehen¬ 
sive,  and  accessible  study  of  important  issues  for  supporting  high-performance  floating¬ 
point  execution. 

A  synchronization  problem  exists  between  the  integer  and  floating-point  units  that 
causes  the  FPU  to  stall  the  IPU.  This  can  be  overcome  through  the  use  of  decoupling  data 


and  instruction  queues,  a  reorder  buffer,  and  result  busses.  Increasing  the  number  of  queue 
or  reorder  buffer  entries  results  in  improved  performance  that  cannot  be  equalled  either 
through  pipelining  the  FPU  functional  units,  or  by  attempts  to  reduce  floating-point  func¬ 
tional  unit  latency,  both  of  which  require  a  significant  increase  in  resources. 

One  important  class  of  stall  conditions  can  be  addressed  by;  analyzing  memory  sys¬ 
tem  characteristics;  code  scheduling  to  improve  FPU  performance  on  commonly  encoun¬ 
tered  instruction  sequences;  selection  of  the  FPU  instmction  and  data  transfer  point  in  the 
integer  pipeline;  and  the  degree  of  instruction  issue.  Instruction  issue  policies  attempt  to  ex¬ 
ploit  available  parallelism  that  exists  in  the  instruction  stream.  Different  policies  offer  de¬ 
sign  points  which,  while  achieving  similar  performance,  vary  with  respect  to  design 
complexity  and  resource  requirements.  The  most  promising  designs  emphasize  either  the 
extraction  of  instruction-level  parallelism  through  greater  complexity,  or  focus  on  simplic¬ 
ity  to  increase  clock  frequency.  Verification  consumes  an  ever-increasing  share  of  design 
time  as  processors  become  more  complex.  Methods  of  functional  and  performance  valida¬ 
tion  of  the  FPU  are  discussed.  Several  utilities  were  created  to  support  implementation  of 
the  high-speed  VLSI  chips  used  in  the  project,  and  suggestions  for  an  automated  approach 
to  performing  timing  analysis  and  logic  optimization  are  presented. 

The  culmination  of  this  work  has  been  the  design  of  an  IEEE-754  compliant  double 
precision  floating-point  unit;  the  chip  was  designed  in  a  1  .Op.m  GaAs  direct-coupled  FET 
logic  process.  Most  of  the  conclusions  regarding  architectural  optimizations  are  indepen¬ 
dent  of  technology,  though  a  number  of  trade-offs  in  the  design  were  made  within  the  con¬ 
straints  of  integration  levels,  fanin,  fanout,  logic  topologies,  speed,  and  power  of  GaAs 
direct-coupled  FET  logic.  The  final  FPU  achieves  a  high  level  of  performance  that  exceeds 
many  current  leading  commercial  processors. 
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CHAPTER  1 
Introduction 

The  use  of  sophisticated  hardware  coprocessors  for  floating-point  computations  has 
occurred  primarily  within  the  past  8  to  10  years.  VLSI  chips  devoted  to  floating-point  arith¬ 
metic  appeared  in  the  early  1980’s  and  at  first  offered  only  single-chip  adders  and  multipli¬ 
ers.  These  tended  not  to  be  pipelined  and  often  required  external  control  units  and  register 
sets.  The  appearance  of  high  performance  workstations  has  resulted  in  an  increasing  num¬ 
ber  of  applications  that  utilize  floating-point  arithmetic.  Fields  such  as  computer  graphics, 
which  in  the  past  have  depended  on  integer  arithmetic,  are  moving  toward  specialized  float¬ 
ing-point  graphics  processors.  In  addition,  the  past  decade  has  seen  significant  growth  in 
digital  signal  processing,  and  a  similar  move  away  from  range  and  precision  constraints 
through  the  use  of  floating-point  numbers.  In  many  ways,  the  brief  history  of  floating-point 
processors  parallels  that  of  integer  processors.  Designs  were  initially  simple,  but  with  time, 
performance  gains  were  achieved  through  both  architectural  optimizations  and  technology 
improvements  which  led  to  increases  in  complexity.  Processor  performance  has  improved 
at  a  uniform  rate  of  50%  per  year  over  the  past  10  years  [Upton94];  Figure  1.1  shows  that 
floating-point  unit  clock  frequency  has  also  increased  approximately  50%  per  year  (refer 
to  Appendix  C  for  the  references).  Much  of  the  increase  in  clock  frequency  can  be  attribut¬ 
ed  to  better  process  technology.  As  Figure  1.2  shows,  there  is  also  a  corresponding  increase 
in  the  amount  of  circuitry  used  in  floating-point  units.  In  particular,  addition  and  multipli¬ 
cation  algorithms  have  benefitted  from  optimizations  which  have  reduced  the  latency  of 
both  to  as  few  as  2  cycles.  Addition  algorithms  have  been  optimized  through  leading  zero 
prediction  and  the  mutually  exclusive  characteristics  of  normalization,  alignment,  and 
rounding.  Multiplication  improvements  have  been  due  to  the  use  of  Wallace  arrays  and 
Booth  recoding. 
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Figure  1.1  FPU  Clock  Frequency  vs  Year 


Figure  1.2  FPU  lyansistor  Count  vs  Year 
Integer  processing  units  (PU’s)  and  floating-point  units  (FPU’s)  share  a  similar 
history  of  applying  more  complex  architectural  solutions  in  order  to  gain  performance.  In 
the  early  1980’s,  microprocessors  were  not  pipelined,  did  not  contain  on-chip  caches,  and 
often  required  many  cycles  to  complete  an  instruction.  In  recent  years  many  supercomputer 
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features  have  been  applied  to  the  design  of  microprocessors,  including  pipelining,  caches, 
load/store  instruction  sets,  multiple  execution  units,  higher  degrees  of  instruction  issue,  and 
virtual  memory.  While  some  of  these  approaches  have  been  applied  to  floating-point  de¬ 
sign,  a  more  complete  investigation  of  the  design  space  is  warranted,  with  the  goal  being  a 
latency  tolerant  high-performance  GaAs  FPU.  This  dissertation  represents  a  unique,  com¬ 
prehensive,  and  accessible  study  of  important  issues  for  supporting  high-performance  float¬ 
ing-point  execution. 

The  development  of  a  processor  system  involves  four  steps:  evaluation  of  micro- 
architectural  choices  in  order  to  minimize  cycles-per-instruction  (CPI),  implementation  of 
a  circuit  that  efficiently  suppons  the  architecture  in  light  of  technology  constraints,  func¬ 
tional  validation,  and  critical  path  analysis  to  optimize  clock  frequency.  Architectural  ex¬ 
periments  focus  on  identifying  bottlenecks  that  limit  performance,  such  as  conditions  that 
generate  stalls.  A  number  of  metrics  can  be  used  in  the  architectural  studies.  The  progress 
of  individual  instruction  types  through  the  machine  can  be  tracked  by  the  average  latency 
that  an  instruction  takes  to  reach  the  various  points  of  interest:  the  floating-point  instruction 
staging  area,  the  issue  point  within  the  FPU,  the  completion  of  the  instruction  into  the  re¬ 
order  buffer,  and  the  write-back  of  the  register  file.  Additional  parameters  that  are  impor¬ 
tant  to  consider  include  dynamic  instruction  frequencies,  basic  block  size,  bus  utilization, 
average  degree  of  issue,  sizes  for  different  types  of  resources,  and  of  course  CPI.  These 
measurable  quantities  will  be  used  to  discover  ways  of  improving  performance. 

Processor  performance  has  been  progressing  at  a  historical  rate  of  about  one  percent 
per  week,  and  the  design  and  verification  time  for  any  additional  feature  should  be  justified 
against  this  standard.  An  improvement  of  less  than  10%  is  of  doubtful  benefit,  unless  it  is 
very  simple  to  implement,  especially  in  consideration  of  the  accuracy  limitations  of  most 
analysis  techniques.  For  example,  while  trace-driven  simulation  provides  access  to  billions 
of  instructions,  the  design  space  is  so  broad  and  the  possible  experiments  are  so  varied  that 
simulating  this  amount  of  insfructions  may  not  be  reasonable  for  each  run.  Inevitably,  ques¬ 
tions  arise  about  how  to  improve  the  simulation  speed  and  how  to  ensure  the  veracity  of  the 
results.  These  issues  will  be  briefly  examined  in  the  context  of  instruction  sampling  and  er- 
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ror  sources. 

The  architectural  experiments  presented  in  this  dissertation  begin  with  the  anah  sis 
of  three  issue  policies  for  floating-point  instructions.  These  policies  specify  whether  in¬ 
structions  issue  and  complete  in  order  or  out  of  order;  each  requires  a  different  degree  of 
resources  and  design  complexity.  Integration  levels  are  an  order  of  magnitude  lower  for 
GaAs  than  for  CMOS;  this  drives  many  decisions  concerning  resource  allocation.  The  next 
architectural  experiment  sought  to  answer  the  question:  how  sensitive  is  overall  perfor¬ 
mance  to  the  latency  of  floating-point  functional  units?  Approaches  for  reducing  latency 
tend  to  require  an  increase  in  resources,  since  more  conditions  need  to  be  resolved  earlier 
in  time  through  the  use  of  parallel  logic.  The  effects  of  pipelining  functional  units  impacts 
chip  area.  A  designer  needs  to  know  how  the  cost/benefit  ratio  compares  to  that  of  adding 
other  features,  such  as  additional  reorder  buffer  entries.  Design  time  must  also  be  consid¬ 
ered,  since  lower-latency  functional  units  are  often  more  complex  and  require  more  valida¬ 
tion.  Other  questions  related  to  the  characteristics  of  the  functional  units  include:  what 
applications  use  square  root  and  does  it  make  sense  to  support  this  operation  in  hardware; 
are  addition  operations  common  enough  to  warrant  a  second  add  unit;  can  division  be  per¬ 
formed  within  the  multiplication  unit  without  degrading  the  performance  of  multiply  in¬ 
structions;  what  precision  operand  formats  should  be  supported  since  these  also  will  have 
an  impact  on  area  and  critical  paths? 

Since  integration  levels  are  low  for  GaAs,  the  PU  and  FPU  need  to  be  partitioned 
into  separate  chips,  which  increases  the  latency  of  floating-point  operations,  due  to  chip 
crossings.  Queues  can  be  used  to  decouple  these  units  and  allow  more  instruction  slip,  but 
this  approach  has  an  impact  on  the  support  of  precise  exceptions.  The  dissertation  discusses 
different  characteristics  of  memory  and  floating-point  computation  exceptions,  and  shows 
how  performance  will  be  affected  by  each  type.  There  may  also  be  an  intrinsic  difference 
in  clock  frequencies  between  the  PU  and  FPU,  which  queues  can  hide. 

Memory  access  time  has  improved  at  a  much  slower  rate  than  processor  speed;  sup¬ 
plying  instructions  and  data  for  a  machine  with  a  4ns  clock  period  becomes  difficult.  In 
GaAs,  integration  levels  severely  constrain  the  amount  of  cache  that  can  reside  on-chip,  so 


other  techniques  are  needed  to  offset  the  corresponding  loss  in  performance,  such  as 
prefetching  and  higher  bandwidth  for  memor\-  accesses. 

In  currently-available  packaging  technology,  a  multiple  chip  design  places  de¬ 
mands  on  the  limited  number  of  package  pins.  This  argues  for  sharing  I/O  pins  between  the 
FPU  and  data  cache.  The  impact  of  this  approach  is  evaluated.  These  interfacing  issues  also 
affect  the  point  in  the  integer  pipeline  at  which  floating-point  instructions  are  transferred  to 
the  FPU.  Depending  on  the  degree  of  issue  and  the  latency  of  the  first-level  data  cache,  a 
later  transfer  point  might  add  unnecessary  latency  to  every  floating-point  instruction.  Bal¬ 
ance  between  processing  and  communication  resources  is  also  important  as  instruction  par¬ 
allelism  increases  within  the  FPU. 

In  the  multidimensional  design  space  briefly  described  above,  different  design 
points  may  achieve  similar  performance  in  very  different  ways.  For  instance,  a  complex  de¬ 
sign  might  implement  dual  issue  and  an  out-of-order  completion  policy  using  a  reorder 
buffer,  in  effect  trading  clock  frequency  for  a  decrease  in  CPI.  On  the  other  hand,  a  simpler 
design  might  forgo  the  use  of  a  reorder  buffer,  choosing  instead  an  in-order  issue  and  com¬ 
pletion  policy  and  a  more  conservative  mechanism  for  supporting  precise  memory  excep¬ 
tions.  If  the  performance  of  these  two  implementations  is  similar,  the  simple  design  would 
be  favored  for  the  benefit  of  its  shorter  design  cycle  and  an  ability  to  more  easily  track  tech¬ 
nology  improvements. 

Once  a  micro-architecture  has  been  defined,  a  specific  implementation  is  chosen 
that  is  suitable  to  the  features  and  limitations  of  the  target  technology.  For  example,  several 
types  of  adders  are  used  in  various  parts  of  the  FPU  design;  issues  such  as  fanin,  fanout, 
logic  topologies,  area,  speed,  and  power  determine  which  designs  are  best  suited  for  a  GaAs 
processor.  Most  of  the  functional  unit  designs  used  in  the  FPU  originated  with  work  done 
elsewhere.  These  schemes,  all  of  which  strive  to  reduce  latency,  are  evaluated  in  this  dis¬ 
sertation  in  the  context  of  GaAs  technology.  Several  issues  which  have  been  overlooked  in 
those  designs  will  be  briefly  discussed  and  a  novel  approach  for  the  design  of  a  conversion 
unit  will  be  presented.  At  a  higher  level,  the  dissertation  will  examine  predecoding  to  re¬ 
duce  the  gate-depth  of  critical  issue  logic  and  ways  of  balancing  observability  during  test- 
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ing  versus  design  time,  resource  requirements,  and  impact  on  cntical  paths.  Methods  of 
supponing  floating-point  loads,  stores,  and  precise  memor>-  exceptions  will  also  be  dis¬ 
cussed.  Finally,  functional  verification  of  a  complex  design  will  be  exarmned  primarily  in 
the  context  of  random  testing,  both  at  the  functional  unit  level  and  at  the  higher  instruction- 
stream  level. 

The  last  component  of  a  processor  design  involves  the  analysis  tools  that  provide 
feedback  about  timing  and  functionality.  Often,  custom  utilities  need  to  be  written  in  order 
to  address  a  specific  need,  and  these  point  tools  must  be  developed  quickly  since  their  ab¬ 
sence  can  delay  a  particular  phase  of  the  design.  The  programming  environment  chosen  for 
the  tool  will  depend  on  the  nature  of  the  problem,  including  both  how  large  a  design  the 
utility  will  be  used  for  and  how  often  this  particular  analysis  will  be  performed.  Chapter  5 
will  discuss  a  number  of  tools  that  have  been  created  in  order  to  enable  different  analysis 
capabilities,  beginning  with  a  delay  calculator  that  supports  accurate  determination  of 
GaAs  circuit  delays.  This  utility  which  provides  input  to  static  timing  analysis  is  essentially 
a  recursive  network  traversal  engine,  which  in  turn  is  the  core  of  several  other  tools.  For 
example,  a  set  of  tools  based  on  this  engine  has  been  developed  to  support  designs  such  as 
Aurora  HI,  which  uses  a  two-phase  clocking  scheme.  In  spite  of  a  designer’s  best  inten¬ 
tions,  identifying  all  hazards  is  difficult  without  an  automated  approach;  missing  even  one 
such  error  can  seriously  reduce  the  effective  clock  frequency.  Control  of  clock  skew  is  also 
an  important  issue  for  high-frequency  designs,  and  a  second  set  of  utilities  built  on  the  net¬ 
work  traversal  engine  has  been  created  to  extract  the  clock  distribution  network  and  gener¬ 
ate  a  ready-to-run  HSPICE  netlist.  Information  derived  from  simulating  this  netlist  is  used 
in  several  ways,  including  identification  of  latch-to-latch  skew,  clock  transit  time  to  any 
point  on  a  chip,  and  wire  sizing  along  the  distribution  network  to  control  interconnect  re¬ 
sistance.  Timing  analysis  also  focuses  on  reducing  gate-depth  along  critical  paths,  and  a 
third  utility  supports  this  type  of  analysis.  A  latch-based  design  allows  time  to  be  borrowed 
from  the  phase  that  precedes  or  follows  a  critical  path.  Consequently,  the  target  logic  depth 
of  20  gates  per  phase  can  be  relaxed  in  certain  instances  if  the  worst-case  previous  path  is 
shorter  than  this  threshold.  The  levelizer  utility  generates  2D  and  3D  histograms  that  enable 
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the  designer  to  quickJy  identify  sections  of  logic  that  require  improvement. 

Finally,  an  automated  high  level  liming  methodology  is  described  as  a  motivation 
for  future  work.  During  the  design  of  the  FPU,  it  became  apparent  that  a  large  class  of  lim¬ 
ing  optimizations  currently  must  be  performed  by  the  user.  These  transformations  tend  to 
be  quite  mechanical  and  include  factoring  out  late-arriving  signals,  pattern  matching  logic 
optimization,  manual  retiming,  and  buffer  selection  and  sizing.  Altogether,  these  different 
actions  could  be  collected  into  an  extremely  effective  automated  timing  analysis  and  opti¬ 
mization  system. 

The  observations  of  this  dissertation  have  culminated  in  the  implementation  of  a 
GaAs  FPU  which  achieves  a  high  level  of  performance  comparable  to  current  leading  com¬ 
mercial  processors.  Based  on  the  extensive  simulations  of  this  study,  this  FPU  delivers  this 
performance  while  only  infrequently  stalling  the  integer  processing  unit.  The  design  is 
IEEE  compliant  with  respect  to  rounding  modes  and  exceptions,  supports  40  floating-point 
instructions,  consists  of  500,000  transistors,  operates  at  a  clock  frequency  of  250MHz,  and 
achieves  a  SPECfp92  rating  of  greater  than  300. 


CHAPTER  2 

Circuit  Issues  for  GaAs 


2.1  Description  of  the  Technology 

Direct-coupled  FET  Logic  (DCFL)  gates  are  similar  in  topology  to  NMOS,  with  in¬ 
verters  and  NOR  gates  comprising  the  basic  building  blocks.  Enhancement  pulldown  and 
depletion  pullup  devices  are  ratioed  in  such  a  way  as  to  provide  desired  output  high  and  low 
voltages  over  normal  operating  conditions.  The  depletion  device  is  source-gate  connected 
to  provide  a  current  source.  Gate  delay  for  an  unloaded  device  is  on  the  order  of  60ps  and 
loaded  gates  typically  have  delays  in  the  range  of  lOOps  to  150ps.  The  gate  of  a  MESFET 
is  actually  a  Schottky  diode;  there  is  no  gate  dielectric  as  is  in  MOSPET  devices.  This  diode 
gate  introduces  several  unique  issues  into  the  design  of  VLSI  circuits,  one  example  being 
the  small  voltage  swing  for  these  devices.  Since  the  gate  is  a  diode,  the  gate  voltage  for  a 
logic  high  is  clamped  to  a  single  diode  drop,  on  the  order  of  0.6  volts.  For  logic  gates  to 
function  properly  with  such  low  output-high  voltages,  the  enhancement  transistors  must 
have  small  threshold  voltages,  typically  about  0.2  volts.  Consequently,  designs  are  sensi¬ 
tive  to  voltage  drops  along  the  ground  rail.  A  top-level  aluminum  interconnect  plane  is  used 
to  provide  a  clean  ground  with  less  than  20mV  of  noise;  each  cell  connects  directly  to  this 
plane.  IR  drops  along  Vdd  are  not  as  critical  and  gates  operate  correctly  with  little  loss  in 
speed  for  a  power  supply  voltage  as  low  as  1.2  volts.  Power  is  routed  in  Metal  3  and  is  sized 
such  that  no  gate  sees  an  IR  drop  along  Vdd  of  more  than  0.5  volts. 

The  diode  gate  of  a  MESFET  also  results  in  an  unusual  transfer  characteristic,  as 
shown  for  the  NOR  gate  of  Figure  2.1.  As  the  input  voltage  increases  beyond  0.8  volts,  the 
output  voltage  begins  to  rise;  above  a  certain  input  voltage  the  output  erroneously  becomes 
a  logic  one.  When  this  phenomenon  occurs,  the  diode  gate  current  is  large  and  the  gate- 
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Vin 

Figure  2.1  Two-input  NOR  and  TVansfer  Function 


drain  junction  becomes  forward  biased.  Though  DCFL  signals  are  normally  clamped  by  the 
driven  gates,  when  large  buffers  are  used  to  quickly  change  the  state  of  highly  capacitive 
interconnect  the  overdriving  condition  may  occur.  The  buffer  used  to  drive  such  a  net  may 
provide  a  current  which  is  appropriate  for  charging  the  wire  but  is  too  large  for  destination 
gates.  A  solution  used  for  the  Aurora  I  design  (the  first  of  three  GaAs  microprocessors  de¬ 
signed  in  our  research  group)  involved  placing  a  diode  at  the  output  of  each  buffer  cell  in 
order  to  clamp  the  output  high  signal  to  one  threshold  voltage.  A  more  effective  feedback 
approach  for  buffering  was  subsequently  obtained  from  Vitesse  and  has  been  used  for  later 
designs.  Shown  in  Figure  2.2  [Fulkerson91],  this  buffer  provides  a  large  transient  current 
to  charge  a  wire,  then  reduces  its  drive,  providing  a  smaller  current  to  maintain  a  stable  log¬ 
ic  high  voltage.  The  small  feedback  transistor  (typically  5  microns  wide)  serves  the  same 
purpose  as  the  diode  used  in  the  earlier  design,  but  has  the  advantage  of  more  quickly  dis¬ 
charging  the  internal  node  of  the  buffer.  To  improve  the  noise  margins  and  yield  of  this  con¬ 
struct,  a  small  diode  is  often  added  at  the  drain  of  the  feedback  transistor.  This  diode  serves 
to  boost  the  voltage  level  of  the  internal  node,  and  consequently  the  level  of  a  logic  high. 
The  feedback  transistor  is  on  only  while  driving  a  logic  high.  For  a  logic  low,  the  pullup 
transistor  of  the  output  stage  is  off  and  the  pulldown  transistor  acts  as  a  current  sink.  This 


Figure  2.2  Feedback  Buffer 
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behavior  adds  a  frequency-dependent  component  to  overall  power  dissipation,  whereas  the 
power  dissipation  of  conventional  DCFL  gates  is  almost  independent  of  the  operating 
speed.  These  buffers  are  among  the  most  commonly  used  cells  in  a  design,  and  can  contrib¬ 
ute  more  than  40%  of  the  power  of  a  chip;  to  be  accurate,  estimation  of  power  dissipation 
must  reflect  this  dynamic  power  component. 

Another  characteristic  of  the  GaAs  technology  being  discussed  is  a  high  transistor 
source  resistance,  which  tends  to  limit  the  use  of  stacked  transistor  logic,  such  as  NAND 
gates.  A  2-input  NAND  represents  the  highest  degree  of  stacking  that  is  allowable,  but  does 
not  offer  a  speed  advantage  compared  to  an  inverter-NOR  version  of  the  same  function. 
The  use  of  only  DCFL  NOR  gates  can  increase  the  number  of  levels  along  critical  paths 
unless  special  circuit  structures  are  used;  critical  path  lengths  for  the  Aurora  n  design  had 
15%  more  gates  than  a  comparable  CMOS  implementation.  One  such  structure  used  exten¬ 
sively  for  later  designs  is  an  Earle  latch  that  combines  a  2-input  mux  with  a  latch  and  a  high- 
current  output  buffer.  The  latch/output-buffer  operates  in  a  feedback  mode  similar  to  that 
just  discussed.  This  circuit  accounted  for  40%  of  the  circuit  area  of  the  Aurora  n  design. 

Leakage  currents,  which  are  several  orders  of  magnitude  larger  in  GaAs  than  sili¬ 
con  devices,  constrain  the  maximum  fanin  of  a  DCFL  gate  to  four  inputs.  More  inputs  re¬ 
duce  the  logic  high  level,  especially  at  higher  temperatures.  However,  there  are  times  when 
a  larger  fanin  gate  is  needed  to  reduce  delay  along  a  critical  path.  A  solution  can  be  found 
by  extending  the  feedback  buffer  to  support  the  additional  inputs. 

Because  this  gate  type  ensures  that  output  pullup  and  pulldown  transistors  are  not 
on  simultaneously,  one  can  size  these  transistors  solely  for  their  driving  capability  and  not 
in  consideration  of  noise  margins.  As  a  result,  the  pullup  device  will  be  large  enough  to  ac¬ 
commodate  the  leakage  currents  of  a  reasonable  number  of  pulldown  devices,  while  still 
providing  a  sound  logic  one.  Figure  2.3  shows  a  4  input  version  of  this  type  of  gate.  These 
buffered  gates  need  to  be  used  judiciously  since  they  are  more  costly  in  terms  of  area  than 
their  DCFL  counterparts,  as  shown  in  Table  2.1.  The  5-  to  8-input  DCFL  areas  listed  in  this 
table  are  estimates,  since  these  gates  are  not  functionally  reliable  across  process  and  tem¬ 
perature  comers  and  thus  have  not  been  implemented.  The  buffered  versions  of  these  gates 
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Figure  2.3  4-Input  NOR  Squeeze  Gate 

use  large  transistors  for  the  output  stage;  the  difference  in  area  could  be  reduced  by  using 
smaller  transistors,  but  at  the  expense  of  capacitive  driving  capability. 


The  diode  gate  also  makes  pass  gates  and  dynamic  logic  more  difficult  or  altogether 
impractical.  The  control  and  data  signals  to  pass  gates  need  to  be  driven  by  buffered  cells 
to  operate  correctly  at  high  temperatures.  The  register  file  design  in  the  first  two  Aurora 
chips  made  use  of  a  pass  gate  latch  in  order  to  reduce  area,  and  was  simulated  thoroughly 
in  order  to  ensure  functionality.  Dynamic  logic  tends  to  be  difficult  to  implement  in  GaAs 
due  to  the  current  which  flows  into  the  gate  terminal  of  MESFET  transistors.  This  current 
is  typically  small,  on  the  order  of  10  to  30  uA,  but  can  still  be  disruptive  for  dynamic  pre¬ 
charging  of  nodes.  Accurate  analysis  tools  are  a  necessity  when  utilizing  dynamic  logic  and 
a  failure  in  this  area  can  result  in  a  design  which  is  not  functional  at  any  frequency.  Current 


Table  2.1  Area  Comparison  of  DCFL  and  Buffered  NOR  Gates 


Fanin 

DCFL  Area 
(um2) 
♦estimated 

Buffered  Area 
(uni2) 

%  Difference 

1 
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1364 

144 
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823 

1730 

110 

3 
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1401 
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63 
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2591 

50 
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2095* 

2876 

n  37 
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2503* 

3162 

26 

8 

2961* 

3434 
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into  the  gate  of  transistors  also  places  a  constraint  on  hov.  much  fanout  is  tolerable,  since 
the  loads  connected  to  a  dnving  gate  may  exceed  the  current  sourcing  capability  of  that 
gate.  The  distribution  of  reset  is  a  case  where  performance  may  not  be  an  issue  and  fanout 
may  be  quite  large.  Analysis  tools  need  to  identify  situations  like  this  in  order  to  ensure 
proper  functionality. 

2.2  Importance  of  Interconnect 

The  importance  of  interconnect  in  a  VLSI  process  cannot  be  overstated.  The  switch¬ 
ing  delay,  t,  for  any  logic  family,  is  related  to  the  difference  in  charge  between  states  at  the 

output  of  a  logic  gate,  and  to  the  current  available  to  effect  a  change  of  state:  t «  Sen¬ 
sitivity  to  parasitic  loading  varies  with  process  and  logic  family.  In  FET  technologies,  this 
is  the  dominant  delay  mechanism;  it  calls  for  small  logic  swings,  high  transconductance, 
and  low  parasitic  capacitance. 

Most  of  the  parasitic  capacitance  comes  from  interconnect.  Of  primary  importance 
is  keeping  the  circuit  area  as  small  as  possible  to  minimize  wire  length;  this  reduces  both 
parasitic  capacitance  and  time-of-flight  for  signals.  Routing  capacitance  is  minimized  by 
using  sufficient  levels  of  interconnect,  narrower  lines,  larger  separation  between  intercon¬ 
nect  layers,  and  lower  dielectric-constant  insulators.  The  effect  of  reducing  the  space  be¬ 
tween  lines,  as  is  done  when  processes  are  scaled,  is  not  immediately  obvious;  while  it 
reduces  the  circuit  area,  it  does  increase  horizontal  line-to-line  capacitance.  To  evaluate  the 
significance  of  interconnect  on  overall  area  utilization,  we  looked  at  an  8x8  multiplier  im¬ 
plemented  in  several  GaAs  process  technologies.  The  total-routing-area  data  shown  in 
Table  2.2  makes  a  strong  case  for  reducing  interconnect  spacing  to  the  fabrication  limit 
[Brown92b].  Table  2.2  also  demonstrates  how  smaller  transistor  dimensions  have  a  smaller 
effect  on  overall  layout  utilization. 

The  importance  of  minimizing  interconnect  capacitance  is  illustrated  in  Figure  2.4 
and  Figure  2.5,  which  show  the  effects  of  reducing  capacitive  load  and  of  reducing  unload¬ 
ed  gate  delay  on  fom  critical  paths  in  the  Aurora  n  microprocessor.  The  logic  paths  in  these 
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Table  2.2  Comparison  of  8x8  Multipliers  in  Three  DCFL  Processes 


Gale  Meial 

Metal  1 

Metal  2 

Meial  ? 

Total 

Layout  Area 

Total  Routine 

.Area 

Process  A 

1.00 

1.00 

1.00 

1.00 

1.00 

Process  B 

0.60 

0.50 

0.28 

0.49 

0.21 

Process  C 

0.50 

0.97 

1.11 

1.43 

0.97 

0.82 

plots  are  from  the  register  file  (RF),  adder  (A1  and  A2),  and  branch  logic  (BR).  (These  fig¬ 
ures  ignore  the  fact  that  faster  gates  would  have  greater  transconductance  and  therefore 
drive  the  capacitive  loads  more  effectively).  The  plots  show  clearly  that  reducing  intercon¬ 
nect  capacitance  would  be  even  more  effective  at  increasing  circuit  speed  than  would  re¬ 
ducing  intrinsic  gate  delay.  The  effects  on  path  delay  vary  among  the  paths  simulated.  The 
smallest  difference  in  results  is  for  the  branch  logic,  where  a  50%  reduction  in  capacitance 
has  a  40%  greater  effect  than  a  similar  reduction  in  unloaded  gate  delay.  The  biggest  dif¬ 
ference  is  in  the  register  file,  where  capacitance  has  a  248%  greater  effect.  The  branch  path 
consists  of  a  large  number  of  lightly  loaded  paths,  whereas  the  RF  path  involves  a  smaller 
number  of  heavily  loaded  gates. 

The  importance  of  having  enough  layers  of  intercoimect  merits  further  illustration. 
In  our  designs.  Gate  Metal  and  Metal  1  are  used  for  wiring  inside  of  leaf  cells,  and  Metals 
3 

2.5 
2 

Path  Delay 
(ns)  1.5 

1 

0.5 
0 

0  10  20  30  50 

Interconnect  Capacitance  Reduction  (%) 

Figure  2.4  Interconnect  Capacitance  Reduction  (%) 
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Gate  Delay  Reduction  (%) 

Figure  2^  Unloaded  Gate  Delay  Reduction  (%) 

1,  2,  and  3  are  used  for  datapath,  standard  cell,  and  global  routing.  Metal  4  is  a  ground 
plane,  and  Vdd  is  distributed  on  Metal  3.  Table  2.3  shows  the  improvement  in  density 
which  resulted  in  moving  from  HGaAs  n  (a  3-metal  process)  to  HGaAs  HI  (a  4-metal  pro¬ 
cess).  Of  course,  geometric  design  rule  changes  between  the  processes  and  other  factors 
also  effect  the  density,  but  cells  tend  to  be  interconnect-limited  instead  of  device-limited. 
The  control  blocks  are  different  circuits  (bypass  logic  in  HGaAs  n  and  stall  logic  in  HGaAs 
ni),  but  they  are  about  the  same  size,  and  both  are  implemented  in  standard  cells  using  the 
same  logic  synthesis  tool  (Finesse,  from  Cascade  Design  Automation).  The  register  files  in 
Table  2.3  are  both  32-word  x  32-bit,  three-port,  tree-decoded,  pass-gate  latch  implementa¬ 
tions,  which  differ  only  in  buffering. 


The  density  numbers  for  both  CPU’s  include  all  of  the  unoccupied  space  in  the  pad 
frame  -  there  is  actually  more  unused  space  in  the  version  with  4-metal  interconnect.  Some 
of  the  increase  in  density  is  due  to  the  inclusion  of  additional  memory  structures  for  the 
small  on-chip  instruction  cache  on  the  4-metal  chip.  But  even  when  this  difference  is  fac¬ 
tored  out,  the  HGaAs  III  version  of  the  CPU  is  still  about  2.4  times  denser.  In  this  analysis, 
half  of  the  improvement  is  due  to  the  third  layer  of  routing;  improved  circuit  structures  and 
layout  techniques  incorporated  into  newer  CAD  tools  account  for  another  35%,  and  the  re- 


Table  2.3  Density  comparison  between  3-metal  and  4-metal  processes. 


Circuit 

HGaAs  11 

Transistor 

Count 

Density 

(Trans./mm~) 

HGaAs  111 

Transistor 

Count 

Density 

(Trans./mm“) 

Largest 

Control 

Block 

582 

1067 

516 

1364 

Register 

File 

21,910 

2014 

23.278 

4253 

CPU 

60,500 

540 

160,000 

1475 

maining  15%  of  improvement  results  from  smaller  line  widths  in  the  HGaAs  IE  process. 


Adding  interconnect  layers  to  a  digital  process  beyond  a  routeable  gate  metal,  3  in- 
tercormect  levels,  and  a  ground  plane  would  result  in  diminishing  returns.  Trying  to  achieve 
high  performance  in  a  DCFL  process  with  fewer  than  five  layers  or  with  a  coarse  intercon¬ 
nect  pitch  or  an  inefficient  design  style,  though,  starts  a  vicious  cycle.  A  larger  layout  has 
more  capacitance,  therefore  requiring  larger  buffers,  which  further  increase  the  layout  size, 
parasitic  capacitance  and  power  dissipation,  further  requiring  larger  buffers. 

2.3  Importance  of  Technology  Support  for  On> 

Chip  Memory 

The  data  and  conclusions  for  this  section  are  derived  from  work  done  by  Ajay 
Chandna  [Chandna94,  Brown92b]. 

On-chip  memory  that  is  fast  and  efficient  in  area  and  power  is  essential  to  achieving 
performance  for  modem  processor  designs.  The  latency  in  going  off-chip  for  the  first  level 
data  cache  in  the  Aurora  HI  architecture  is  very  costly  to  overall  performance,  as  will  be 
discussed  later.  Appropriate  partitioning,  the  use  of  decoupling  queues,  and  a  fast  system 
substrate  all  contribute  to  offsetting  the  lower  integration  levels  of  GaAs.  Ultimately,  how¬ 
ever,  it  is  not  possible  to  avoid  the  need  for  dense,  fast  memory  that  is  closely  coupled  to 
the  computation  units  of  a  design.  This  section  will  discuss  GaAs  technology  in  terms  of 
the  characteristics  that  are  a  necessary  to  adequately  support  memory  on-chip. 
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Sublhreshold  currents  in  MESFETs  are  several  orders  of  magnitude  larger  than 
those  in  MOSP^Ts.  In  static  RAM  structures,  area  and  power  are  strongly  related  to  leak¬ 
age  current.  Though  much  less  attention  has  been  focused  on  minimizing  leakage  currents 
than  on  increasing  transconductance,  leakage  currents  are  as  important  to  performance.  If 
too  many  memory  cells  are  connected  to  a  bit  line,  the  leakage  current  through  the  pass 
transistors  connected  to  unselected  memory  cells  (about  lOOn A/bit)  could  corrupt  the  data 
of  a  selected  memory  cell  (about  20uA).  The  total  leakage  on  a  bit-line  should  be  an  order 
of  magnitude  smaller  than  the  active  current.  Consequently,  for  GaAs,  the  number  of  bits 
that  can  be  safely  connected  to  a  column  is  limited  to  32.  This  constraint  requires  that  a  sig¬ 
nificant  portion  of  the  total  RAM  area  be  devoted  to  sense  amplifiers  and  write  circuitry. 
Table  2.4  shows  how  SRAM  area  would  decrease  if  leakage  currents  could  be  reduced  to 
allow  more  memory  cells  per  column,  thereby  amortizing  the  column  support  circuitry  over 
more  bits  [Oettel92].  As  can  be  seen,  for  this  design  at  32  bits/column  only  70.6%  of  the 
total  chip  area  is  consumed  by  the  memory  cells.  A  reduction  in  leakage  current  by  only 
one  order  of  magnitude  would  increase  the  percentage  of  area  occupied  by  the  memory 
cells  to  92%  of  the  total  area. 

In  any  technology,  the  pullup  of  a  static  RAM  cell  should  provide  just  enough  cur¬ 
rent  to  offset  the  leakage  current  of  the  pulldown  devices.  Leakage  currents,  therefore,  also 
set  the  lower  limit  for  cell  power.  In  conventional  GaAs  DCFL  processes,  long,  minimum- 
width  depletion  transistors  are  used  to  keep  this  current  small.  The  characteristics  of  these 
devices  present  an  area/power  trade-off;  for  example,  in  the  SRAM  used  for  the  Aurora  III 

design,  the  highest  impedance  standard-threshold  depletion  transistor  that  fits  in  a  400um^ 
cell  provides  much  more  current  than  is  needed  to  offset  the  leakage  currents.  As  the  area 
of  the  cell  is  decreased,  the  pullup  length  must  be  decreased,  increasing  the  power.  Figure 


Table  2.4  Effect  of  Reducing  Leakage  Currents  on  Area  of  lKx8  SRAM 


Number  of  Bits  /  Column 

32 

64 

128 

256 

512 

Normalized  SRAM  Area 

1.00 

0.87 

0.80 

0.77 

0.75 

Cell  Area  Percentage  of  Total  Area 

70.6 

81.6 

88.4 

92.1 

93.8 
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Figure  2.6  SRAM  Cell  Power  vs.  Cell  Size 

2.6  shows  the  effect  of  varying  the  pullup  length  (cell  size)  on  power  dissipation.  This  plot 
includes  curves  for  a  digital  process  pullup  transistor,  a  special  higher-threshold  depletion 
transistor,  and  a  polysilicon  load.  The  polysilicon  load  curve  was  constructed  assuming 
hghtly-doped  resistors,  which  can  be  located  above  the  remaining  4  transistors,  adding  no 
additional  area.  As  seen  in  the  Figure  2.6,  poly  loads  are  invaluable  to  SRAM  designs. 


2.4  Summary 

GaAs  was  chosen  as  the  implementation  technology  for  several  reasons.  First,  its 
fast  gate  switching  speeds  and  low  power  supply  voltage  seem  to  offer  a  desirable  power- 
delay  product  for  high-performance  VLSI  designs.  We  have  designed  three  processors  in 
GaAs,  ranging  from  a  very  simple  machine  to  a  full-functioned  superscalar  design.  These 
designs  provided  an  opportunity  to  evaluate  GaAs  DCFL  in  realistic  VLSI  designs.  Second, 
GaAs  MESFET  process  technology  is  fairly  simple,  requiring  many  fewer  fabrication  steps 
than  CMOS  and  offering  the  prospect  of  reasonable  yields  (and  cost)  for  large  designs.  Oth¬ 
er  factors  which  can  affect  cost  include  integration  density,  packaging,  wafer  cost,  and  de¬ 
sign  time.  Third,  this  technology  provides  a  foundation  for  exploring  a  wide  range  of  high- 
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performance  issues  which  might  otherwise  be  difficult  for  a  university  research  project. 


CHAPTER  3 


Architectural  Issues  for  a 
High  Performance  Floating- 
Point  Unit 


3.1  Previous  Work 

Previous  work  by  the  GaAs  Microprocessor  group  at  the  University  of  Michigan 
has  focused  on  both  the  development  of  a  CAD  environment  appropriate  for  GaAs  design 
and  two  initial  implementations  of  the  CPU  architecture.  The  first  version,  called  Aurora  I, 
was  based  on  a  simple  5  stage  pipeline,  with  32  word  register  file  and  ALU.  The  chip  exe¬ 
cuted  30  instructions  from  the  MIPS  Instruction  Set  Architecture  (ISA),  consisted  of 
60,000  transistors,  and  operated  at  lOOMHz  [Brown92a],  [Brown93].  This  chip  served  to 
drive  the  selection  and  development  of  many  of  the  tools  needed  to  design  and  analyze 
VLSI  circuits  in  GaAs.  The  goal  for  the  second  generation  of  the  CPU,  Aurora  E,  was  to 
study  issues  in  high  speed  microprocessor  architectures,  including  support  for  caches  and 
exceptions.  The  chip  implemented  an  additional  10  instructions,  was  comprised  of  160,000 
transistors  (in  the  same  area  as  Aurora  I),  and  operated  at  180MHz  [Upton93].  In  addition, 
timing  analysis  capability  was  added  to  the  suite  of  design  tools. 

3.2  Aurora  HI  System  Overview 

A  block  diagram  for  the  Aurora  IE  system  is  shown  in  Figure  3.1.  The  system  is 
comprised  of  four  custom  GaAs  chips:  three  logic  chips  and  a  32K-bit  SRAM  used  for 
building  a  64  K-byte  external  data  cache.  The  logic  chips  are  the  Integer  Processing  Unit 
(IPU),  the  Floating-Point  Unit  (FPU),  and  the  Memory  Management  Unit  (MMU).  The  IPU 
consists  of  five  functional  modules  that  operate  semi-autonomously  to  fetch,  decode,  exe- 


19 


20 


Floating  Point  Unit 


Score 

Board 

— ► 

Integer 

Execution  Unit 


▼ 

Reorder 

Buffer 

_ ^ 

_pr 

- ^ 

1  Pipe  0 

Reorder 

Buffer 


Pipelined 
External 
Data  Cache 


integer  Processor  Unit 

Figure  3.1  Processor  Block  Diagram 

cute  and  retire  instructions.  The  IPU  is  similar  to  the  IBM-Motorola  PowerPC  603  and  604 
processors  [Diefendorff94]  in  that  it  includes  a  Bus  Interface  Unit  (BIU),  an  Integer  Exe¬ 
cution  Unit  (lEU),  an  Instruction  Fetch  Unit  (IFU),  and  a  Load  Store  Unit  (LSU).  In  addi¬ 
tion,  the  IPU  has  a  dedicated  Prefetch  Unit  (PFU)  for  data  and  instructions.  The  BIU 


provides  sustained  transfer  rates  of  1.5  G-bytes  per  second  over  a  32-bit  bidirectional  bus 
using  a  collision-based  protocol.  The  clock  is  sent  along  with  the  data,  which  allows  trans¬ 
fers  on  both  clock  edges.  The  IFU  fetches  instructions  either  from  a  partially  decoded  on- 
chip  instruction  cache  or  from  secondary  memory  via  the  BIU  and  MMU.  An  instruction 
miss  will  stall  issue,  but  the  back  end  of  the  pipeline,  including  active  data  references  in  the 
LSU,  can  proceed.  Static  branch  prediction  is  currently  supported  and  a  dynamic  scheme 
could  be  easily  added  to  a  future  version  of  the  design.  The  lEU  contains  2  copies  of  the 
ALU  and  register  file,  which  are  symmetric  in  order  to  simplify  issue  determination.  The 
LSU  interfaces  directly  to  a  3-cycle  pipelined  off-chip  data  cache  and  supports  non-block¬ 
ing  load  instructions  via  several  miss-status  holding  registers.  As  a  result,  instruction  issue 
stalls  for  a  load  miss  only  if  a  subsequent  instruction  needs  the  result  of  the  load.  Further, 
the  LSU  contains  a  4-entry  coalescing  write  cache  to  reduce  store  traffic  across  the  BIU. 
Since  the  MMU  does  not  reside  on  the  same  chip  as  the  IPU,  the  tags  of  the  write  cache  also 
acts  as  a  micro-TLB,  allowing  store  instructions  to  be  retired  quickly  from  the  integer  re¬ 
order  buffer.  Finally,  the  LSU  is  responsible  for  transferring  floating-point  instructions  and 
data  to  the  FPU.  Figure  3.2  shows  a  more  detailed  view  of  the  FPU,  the  characteristics  of 
which  will  be  discussed  in  greater  detail  below. 

3.3  Simulation  Methodology 

The  floating-point  simulator  developed  for  this  study  is  built  upon  a  modified  ver¬ 
sion  of  Mike  Smith’s  “xsim”  trace-driven  simulator  [Smith87].  Changes  were  made  to  rep¬ 
resent  the  Aurora  HI  architecture,  including  dual  issue  of  instructions,  prefetching  of 
instructions  and  data,  an  appropriate  memory  subsystem,  and  other  characteristics  dis¬ 
cussed  in  Section  3.2.  Additions  to  the  simulator  were  required  to  implement  all  FPU  func¬ 
tionality.  Pixie,  a  program  developed  by  MIPS  Computer  Systems  Inc.,  is  used  to  annotate 
an  application  to  be  studied  with  assembly  instructions  that  output  information  about  mem¬ 
ory  references  and  branches.  The  simulator  executes  the  “pixiefied”  object  file  and  pipes 
the  output  back  to  the  analysis  routines  of  the  simulator.  Accurate  information  about  the 
state  of  the  machine  is  used  to  determine  cycle  and  instruction  counts,  as  well  as  specific 


Figure  3.2  Aurora  m  FPU  Block  Diagram 


information  about  what  causes  stalls  (full  reorder  buffer,  result  bus  conflicts,  data  depen¬ 
dencies,  etc.).  The  SPECfp92  benchmarks  are  used  to  represent  a  typical  scientific  work¬ 
load.  These  are  comprised  of  14  applications,  some  of  which  are  more  vectorizable  than 
others  (refer  to  Table  3.1).  Figure  3.3  shows  the  dynamic  instruction  breakdown  for  each 
of  the  benchmarks.  These  applications  are  written  in  either  FORTRAN  or  C,  and  most  use 
either  single  or  double  precision  numbers  exclusively.  Since  many  experiments  were  to  be 
run  and  each  program  executed  adds  to  the  runtime,  subsets  of  the  benchmarks  were  often 
used;  the  rationale  for  choosing  to  use  certain  benchmarks  will  accompany  the  discussion 
of  the  experiment.  All  experiments  included  at  least  50  million  instructions  and  some  had 
as  many  as  1  billion  instructions.  A  larger  set  of  sizes  was  used  for  the  various  IPU  resourc- 


Table  3.1  SPECfp92  Benchmarks 


Benchmark 

Commeni> 

alvinn 

Neural  network  used  lor  dnvmg  an  automobile 
-  single  precision 
-C 

doduc 

Monte  Carlo  simulation  of  a  nuclear  reactor 

-  double  precision 

-  non-veciorizable 

-  many  subroutines 
.  FORTRAN 

ear 

Use  of  FFT’s  to  simulate  the  human  ear 
-  double  precision 
-C 

fpppp 

Quantum  chemistry  program 
*  double  precision 

-  difficult  to  vectorize 

-  Fortran 

hydro2d 

Astrophysics  program  to  solve  for  galactical  jets 

-  double  precision 

-  vectorizable 

-  48%  of  time  spent  in  one  subroutine 
-FORTRAN 

mdljdp2 

mdljsp2 

Solves  equations  of  motion  for  500  molecules 

-  double  or  single  precision 

-  vectorizable 
-FORTRAN 

nasa7 

Seven  floating-point  intensive  tests 

-  double  precision 

-  various  matrix  operations  and  radix-2  FFTs 

-  FORTRAN 

ora 

Ray  tracing  through  spherical  and  planar  optics 
-  double  precision 
-FORTRAN 

spice2g6 

1  Analog  circuit  simulator 

-  double  precision 

-  causes  high  data  cache  miss  rates 
-FORTRAN 

su2cor 

Quantum  physics  calculation  of  elementary  particle  masses 

-  double  precision 

-  vectorizable 

-  52%  of  time  spent  in  one  subroutine 
-FORTRAN 

swm256 

Solution  of  a  system  of  shallow  water  equations  using  finite 
difference  approximations 

-  single  precision 

-  vectorizable 

-  48%  of  time  spent  in  one  subroutine 
-Fortran 
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Table  3.1  SPECfp92  Benchmarks,  continued 


Benchmark 

Comments 

lomcatv 

Analyzes  geometnc  domains,  such  as  airtoils  and  cars 

-  mixture  of  single  and  double  precision 

-  highly  vectonzable 

-  high  data  cache  miss  rales 

-  Fortran 

waves 

Solution  of  Maxwell’s  equations  and  panicle  equations  of 
motion 

-  single  precision 

-  Fonran 

es  and  was  held  constant  throughout  all  experiments,  as  summarized  in  Table  3.2. 

3.4  Evaluation  Criteria 


Three  synchronization  points  in  the  design  can  cause  the  FPU  to  stall  the  IPU.  Of 
these  stalls,  the  first  occurs  when  the  instruction  (Iq)  or  load  (Lq)  data  queue  in  the  FPU 
becomes  full;  this  in  turn  depends  on  the  various  constraints  that  can  prevent  issue  or  dis¬ 
patch  from  the  queues.  (The  distinction  between  issue  and  dispatch  will  be  discussed  fur¬ 
ther  in  the  section  on  issue  policies.)  The  second  stall  source  happens  whenever  a  branch- 
on-FPU  instruction  awaits  the  completion  of  a  corresponding  floating-point  compare  in¬ 
struction.  Again,  a  full  range  of  internal  stall  conditions  can  delay  completion  of  the  com¬ 
pare  instruction.  The  last  type  of  stall  occurs  when  the  LSU  write  cache  is  full,  and  eviction 


Table  3.2  IPU  Resources  Used  for  Simulation  Experiments 


Integer  Reorder 
Buffer  Entries 

8 

Number  Out¬ 
standing  Load 
References 

4 

Number 

Prefetch  Buffers 

16 

Write  Cache 
Entries 

8 

Primary  Data 
Cache 

64K 

Primary  Instruc¬ 
tion  Cache 

4K 

Secondary  Data 
Cache 

8M 

Secondary 

Instruction 

Cache 

8M 

Data  Cache  Line 
Size 

32  bytes 

Instruction 

Cache  Line  Size 

32  bytes 

Primary  Miss 
Latency 

17  cycles 

Secondary  Miss 
Latency 

60  cycles 
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Figure  33  Dynamic  Instruction  Breakdown  for  SPECfp92 


is  prevented  because  one  or  more  entries  are  waiting  for  floating-point  store  data.  Other  en¬ 


tries  may  also  be  locked  if  an  outstanding  cache  line  has  yet  to  be  returned  from  the  sec¬ 


ondary  memory  system.  In  both  of  these  cases,  loads  and  stores  will  stall  issue  to  the  LSU 


whenever  it  is  not  possible  to  evict  a  write  cache  entry. 


The  underlying  mechanisms  which  trigger  these  3  cases  will  be  discussed  in  detail 
in  the  analysis  that  follows.  To  facilitate  comparisons,  some  combination  of  the  following 
metrics  will  be  utilized: 


1 .  The  average  rate  of  tranrferring  floating-point  instructions  to  the  FPU.  The  interface 
between  the  IPU  and  FPU  supports  a  maximum  rate  of  2  instructions  per  cycle. 

2.  The  frequency  of  issuing  or  dispatching  more  than  one  floating-point  instruction  per 
cycle.  An  instruction  is  considered  to  have  issued  when  all  source  operands  are  avail¬ 
able  and  the  instruction  has  been  sent  to  a  functional  unit  for  execution.  On  the  other 
hand,  dispatch  occurs  only  for  an  out-of-order  issue  policy  and  refers  to  an  instruc¬ 
tion  that  is  transferred  from  the  queue  to  the  reservation  station  of  a  given  functional 
unit;  this  occurs  only  when  a  source  operand  that  is  needed  by  an  instruction  is  not 
yet  available.  When  the  data  becomes  available,  the  instruction  will  proceed  to  issue 
from  the  reservation  station.  For  in-order  issue,  the  maximum  issue  rate  will  be  lim¬ 
ited  to  2,  for  reasons  to  be  discussed  below.  For  an  out-of-order  issue  policy,  it  is 
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possible  to  have  a  peak  issue  equal  to  the  number  of  functional  units;  maximum  dis¬ 
patch,  and  hence  the  maximum  average  throughput,  will  also  be  limited  to  2. 

3.  The  3  external  stalls  mentioned  above:  Iq/Lqfull,  branch-on-FPU,  and  write  cache 
full  due  to  outstanding  floating-point  stores.  These  high-level  stalls  will  be  decom¬ 
posed  into  the  various  conditions  that  can  prevent  issue  or  dispatch.  Floating-point 
stalls  per  instruction  (FSPI)  is  metric  which  combines  all  three  stalls  and  will  com¬ 
plement  basic  CPI. 

4.  Average  latencies  and  analysis  of  the  components  that  comprise  these  latencies.  For 
each  floating-point  instruction  type  this  will  include  the  average  latency  for  issue 
and/or  dispatch  (depending  on  the  issue  policy),  for  results  becoming  available,  and 
for  results  writing  back  to  the  floating-point  register  file.  The  dispatch  and  issue 
latencies  are  measured  from  the  time  a  floating-point  instruction  issues  in  the  IPU, 
and  indicate  how  long  it  takes  for  the  instruction  to  proceed  through  the  ALU  and 
LSU  and  finally  dispatch  or  issue  from  the  floating-point  instruction  queue.  Stall 
sources  within  the  IPU  include:  1)  waiting  for  a  full  integer  reorder  buffer  (floating¬ 
point  instructions  must  also  reserve  an  integer  reorder  buffer  entry  in  order  to  sup¬ 
port  precise  memoiy  exceptions),  2)  waiting  for  access  to  the  data  cache  busses,  and 
3)  waiting  for  a  full  instruction  or  load  data  queue  within  the  FPU.  The  latency  for 
results  becoming  available  is  measured  to  the  time  at  which  the  result  data  is  actually 
written  into  the  floating-point  reorder  buffer.  The  final  latency,  for  writing  results 
back  to  the  register  file,  is  measured  to  the  time  at  which  an  entry  reaches  the  head 
of  the  reorder  buffer;  this  latency  indicates  the  length  of  time  that  data  resides  in  the 
reorder  buffer  and  infers  the  number  of  entries  ahead  in  the  reorder  buffer  and  how 
long  they  stall  write-back  to  the  register  file.  For  floating-point  load  instructions,  the 
average  latency  is  an  indication  of  how  often  these  memory  references  hit  in  the  data 
cache.  Some  of  the  experiments  will  also  discuss  the  average  issue  point,  as  opposed 
to  the  average  latency  to  issue.  The  former  indicates  the  average  time  taken  by  an 
instruction  to  reach  the  point  in  the  instruction  queue  where  issue  is  possible.  The 
difference  between  issue  point  and  issue  latency  is  an  indication  of  how  long  issue 
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is  delayed  due  to  constraints  such  as  data  dependencies  and  a  full  reorder  buffer.  The 
average  latency  metric  will  be  valuable  in  tracking  where  different  types  of  instruc¬ 
tions  spend  their  time,  from  their  issue  in  the  IPU  to  completion  in  the  FPU. 

5.  Utilization  of  the  various  busses  shared  by  the  IPU  and  FPU .  Different  events  con¬ 
tend  for  these  busses,  raising  questions  about  whether  a  given  bus  organization  can 
become  a  bottleneck  for  performance. 

6.  Dynamic  instruction  counts  of  individual  integer  and  floating-point  instruction  types. 
Among  other  uses,  this  information  on  the  frequency  of  instruction  types  can  help 
identify  ways  to  more  efficiently  allocate  resources. 

7 .  Optimal  sizes  of  instruction,  load-  and  store-queues,  and  reorder  buffer.  This  can  be 
determined  dynamically  for  each  experiment  and  benchmark  by  using  a  large  num¬ 
ber  of  entries  and  by  keeping  track  of  the  number  of  entries  that  are  actually  utilized. 
Since  these  sizes  should  be  appropriate  for  peak  floating-point  activity,  information 
about  the  number  of  entries  being  utilized  by  a  particular  resource  will  be  updated 
only  when  this  resource  is  accessed. 

8.  Resource  cost.  The  register  bit  equivalent  (RBE)  model  of  Mulder  [Mulder91]  is 
used  to  evaluate  the  resource  cost  of  different  microarchitectural  features.  This 
model  uses  a  normalized  measure  of  area  cost  which  is  based  on  the  size  of  a  1  bit 
static  latch.  For  GaAs  DCFL,  one  static  latch  requires  16  transistors  and  corresponds 
to  an  area  of  3600  square  microns.  Since  static  RAM  elements  are  denser,  a  single 
cell  is  modeled  as  one  half  of  a  RBE.  AdditionaUy,  the  overhead  associated  with 
sense-amplifiers  and  decoding  logic  is  represented  as  a  percentage  of  the  array  size. 
The  cost  of  various  floating-point  resources  is  shown  in  Table  3.3;  these  figures  are 
derived  from  actual  layout  obtained  during  chip  design. 

What  should  be  considered  a  meaningful  improvement  in  performance?  In  light  of 
the  fact  that  processor  performance  increases  by  about  0.8%  per  week  and  50%  per  year, 
any  additional  feature  must  at  least  keep  pace.  Including  both  design  and  verification  time, 
if  a  feature  requires  a  month  to  implement,  it  ought  to  add  greater  than  4%  to  overall  per- 
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Table  3.3  Resource  Cost  in  RBE  Units 


FPL'  Element 

Cost  in  RBE 

1  Roating-pomi  Register  File  (32x64) 

4.700 

1  Scoreboard 

3.600 

1  Instruction  Queue  Entry 

305 

1  Load/Store  Queue  Entry 

220 

1  Reorder  Buffer  Entry 

900 

1  Reservation  Station  Entry  (1/2  operands) 

650/1030 

Control  Logic  (25%  overhead) 

8,000 

1  Add  Unit  (1  to  5  cycles) 

5,000  to  1,250 

1  Multiply  Unit  ( 1  to  5  cycles) 

8,750  to  4,375 

1  Divide  Unit  (10  to  30  cycles) 

S  2,500  to  625 

1  Conversion  Unit  (1  to  5  cycles) 

2,500  to  1,250 

formance.  Further,  if  one  considers  the  inherent  inaccuracy  of  the  various  methods  of  pre¬ 
dicting  performance,  hopefully  only  on  the  order  of  a  few  percent,  a  reasonable  criteria 
might  be  10%;  below  this  amount,  an  idea  may  not  be  worth  implementing.  Clearly  other 
issues,  such  as  the  impact  on  area  (yield  =  cost),  speed,  and  power  are  also  equally  impor¬ 
tant. 


3.5  Issue  Policies 

In  an  increasing  order  of  performance  gains,  issue  and  completion  policies  are  as 
follows; 

1)  in-order  issue,  in-order  completion  (1010), 

2)  in-order  issue,  out-of-order  completion  (lOOO), 

3)  out-of-order  issue,  out-of-order  completion  (OOOO). 

The  first  is  the  simplest,  but  achieves  the  worst  architectural  performance.  In  this 
scheme,  dependency  checking  needs  to  be  done  only  between  a  decoded  instruction  and  the 
few  instructions  that  are  already  in  execution.  Since  results  are  completed  in  order,  there  is 
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no  need  for  reordering  prior  to  wnling  back  to  the  register  file.  This  simple  policy  stalks  in¬ 
struction  issue  whenever  an  instruction  needs  a  different  functional  unit  than  that  used  b\ 
currently  active  instructions  or  when  there  is  a  conflict  for  a  functional  unit.  The  former 
case  is  necessary'  since  the  functional  units  may  have  different  latencies  and  this  policy  re¬ 
quires  instructions  to  complete  in  order.  The  latter  case  occurs  in  functional  units  that  re¬ 
quire  more  than  one  cycle  to  complete  and  allows  only  one  active  instruction  at  a  time. 

The  second  policy  (1000)  will  stall  the  decode  unit  only  when  there  is  a  functional 
unit  conflict  or  a  source  operand  has  not  yet  been  determined.  A  scoreboarding  approach 
can  be  used  to  detect  dependencies  and  conflicts.  This  approach  can  run  into  output  depen¬ 
dency  problems  (an  earlier  result  overwrites  a  later  one),  making  it  necessary  to  add  a 
mechanism  for  reordering  results,  such  as  a  reorder  buffer.  Also,  since  multiple  instructions 
can  complete  at  the  same  time,  arbitration  for  result  busses  is  needed.  The  reorder  buffer  is 
also  needed  to  prevent  erroneously  repeating  the  execution  of  an  instruction  upon  returning 
from  an  exception. 

Out-of-order  issue  attempts  to  increase  look-ahead  capability  by  moving  a  data-de- 
pendent  instruction  past  the  decode  unit  and  into  an  instruction  window  which  resides  be¬ 
tween  the  decode  and  execute  units.  This  issue  policy  increases  the  opportunity  of  finding 
instructions  without  dependencies.  Out-of-order  issue  is  subject  to  an  additional  type  of 
data  dependency,  which  occurs  when  a  subsequent  instruction  changes  a  source  operand  of 
a  yet-to-be  issued  instruction  (anti-dependency).  To  handle  this,  operands  are  forwarded  to 
the  instruction  window  at  the  same  time  the  instruction  is  sent  to  the  window.  However, 
since  operands  may  not  all  be  ready,  a  mechanism  is  needed  for  forwarding  results  from  the 
execution  units  back  to  the  instruction  window.  Out-of-order  issue  may  also  allow  more 
slip  between  the  IPU  and  FPU  by  allowing  instructions  to  be  removed  from  the  instruction 
queue  for  three  cases  in  addition  to  that  of  basic  data-dependencies:  1)  instructions  which 
need  a  busy  non-pipelined  functional  unit,  2)  both  instructions  in  the  issue  pair  need  to  use 
the  same  functional  unit,  and  3)  all  result  busses  are  busy.  The  apparent  advantages  of 
OOOO  may  in  fact  not  be  realized  if  some  of  these  events  occur  infrequently. 

Reservation  stations  are  an  alternative  to  a  monolithic  instruction  window  and  mean 
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that  the  central  window  is  split,  not  necessarily  evenly,  among  the  functional  units.  The  reg¬ 
isters  at  a  reservation  station  can  be  organized  in  several  ways.  If  more  than  one  instruction 
at  a  reservation  station  is  ready  to  issue,  a  random  approach  would  allow  anv  one  to  be  se¬ 
lected.  A  simpler  first-in-first-out  (FIFO)  ordering  might  be  used,  but  with  only  a  small  per¬ 
formance  penalty  since  programs  have  a  significant  amount  of  inherent  sequential  ordering. 
The  main  benefit  of  reservation  stations  is  derived  from  the  fact  that  instruction  level  par¬ 
allelism  (considering  both  integer  and  floating-point  instructions)  could  suppon  a  peak  is¬ 
sue  rate  of  about  6  and  an  average  of  about  3  to  4  [Johnson91].  To  ensure  that  short  term 
demand  does  not  stall  the  decode  and  issue  unit  when  using  a  centralized  window,  as  many 
as  12  source  operand  busses  (2  per  instruction)  and  12  instruction  window  read  ports  might 
be  needed.  Reservation  stations,  on  the  other  hand,  split  the  instruction  window  among  ex¬ 
ecution  unit,  reducing  the  number  of  global  busses  and  instruction  window  required  to  a 
number  closer  to  the  average  issue  rate.  Following  are  the  trade-offs  between  using  a  cen¬ 
tral  instruction  window  and  using  reservation  stations: 

1 .  Interconnect  area  and  register  file  ports  versus  storage  space.  Reservation  stations 
use  more  storage  space,  since  they  typically  have  more  total  entries  than  a  united 
instruction  window.  This  is  offset  somewhat  by  the  fact  that  reservation  station 
entries  need  accommodate  only  the  number  of  operands  required  by  a  given  func¬ 
tional  unit.  Also,  reservation  stations  can  result  in  less  global  interconnect  because 
dedicated  busses  are  needed  between  the  instruction  window  and  each  functional 
unit.  Instead,  fewer  common  tristate  busses  can  be  used  to  direct  operands  from  the 
register  file  and  reorder  buffer  to  the  functional  units. 

2.  More  complex  issue  logic  versus  duplicated  issue  logic.  The  central  window  is  more 
complex  since: 

a.  It  selects  among  a  larger  number  of  instructions. 

b.  It  must  consider  all  functional  unit  conflicts  and  arbitrate  among  them. 

c.  It  must  be  able  to  issue  more  than  one  instruction  per  cycle. 

d.  All  instruction  entries  must  be  of  maximum  width  (the  instruction  and  2  or  3 
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operands).  Reservation  stations  allow  the  register  width  to  be  tailored  to  the 
needs  of  each  functional  unit. 

The  additional  state  needed  for  the  reservation  station  entries  can  be  quite  large. 
Two  entries  per  functional  unit  would  contribute  about  10,000  RBE’s,  or  an  additional 
25%,  to  overall  chip  area.  To  view  this  another  way,  the  resources  required  to  implement 
0000  are  roughly  equivalent  to  the  difference  between  2-cycle  pipelined  and  a  5-cycle  it¬ 
erative  multiply  units.  This  trade-off  for  the  multiply  unit  represents  a  10%  difference  in 
total  machine  performance  (see  Section  3.8.1),  setting  a  standard  for  cost/benefit  for  other 
architecmral  features,  such  as  issue  policy.  In  addition  to  the  impact  on  area  from  out-of- 
order  issue,  the  requirement  to  forward  results  can  be  expensive.  This  policy  adds  complex¬ 
ity  in  the  following  ways: 

1.  Each  functional  unit  needs  to  be  able  to  schedule  result  busses. 

2.  A  tristate  bypass  bus  is  required  for  each  load  queue  entry.  For  a  2-entry  load  queue, 
2  additional  64  bit  busses  would  need  to  be  routed  to  each  functional  unit.  Few  things 
are  more  expensive  in  terms  of  area  than  large  busses,  since  overcell  routing  across 
datapath  sections  tends  to  be  quite  congested.  When  the  available  routing  region 
over  a  datapath  cell  is  filled,  the  routing  spills  over  into  the  channels,  which  increases 
the  overall  pitch  of  that  datapath  section  and  consequently  the  size  of  the  entire  chip. 

3.  A  large  number  of  comparators  are  needed  within  each  reservation  station  entry  to 
check  the  result  tag  on  each  result  bus.  Simulations  discussed  below  conclude  that 
only  2  result  busses  are  needed.  Since  these  are  already  being  routed  throughout  the 
chip,  there  should  be  little  increase  in  interconnect  due  to  forwarding  of  results.  If 
the  number  of  reservation  station  entries  per  functional  unit  does  not  exceed  2,  four 
comparators  would  be  needed  per  functional  unit,  and  24  comparators  overall. 

A  top  critical  path  for  out-of-order  issue  may  now  consist  of  result  bus  arbitration, 
determination  of  source  operand  availability,  source  bus  arbitration,  and  functional  unit  ar¬ 
bitration. 


3.5.1  lOIO  versus  lOOO 


A  baseline  architectural  model  which  uses  in-order  issue  and  in-order  completion 
policies  was  chosen,  and  various  design  alternatives  have  been  compared  against  this  base¬ 
line.  A  single  register  instruction  buffer  (IRB)  is  used  to  store  instructions  sent  by  the  IPU, 
allowing  a  small  amount  of  IPU  slip.  Since  results  must  complete  in  order,  issue  of  two  in¬ 
structions  to  separate  execution  units  is  not  allowed.  This  approach  ensures  that  a  later-is¬ 
sued  instruction  does  not  complete  before  an  earlier  issued  one.  Successive  instructions  can 
be  issued  to  the  same  functional  unit,  to  better  take  advantage  of  the  pipelined  nature  of 
most  of  the  units;  this  form  of  1010  is  called  “pipelined”  in  Table  3.4.  The  simulator  sup¬ 
ports  variable  latencies  for  each  unit,  and  allows  any  unit  to  be  blocking  (not  pipelined). 
For  example,  the  iterative  algorithms  used  for  division  make  the  divide  unit  blocking  in  na¬ 
ture,  since  only  one  divide  instruction  can  be  active  at  a  given  time.  A  register  scoreboard 
is  used  to  determine  data  dependencies.  The  baseline  clock  cycle  latencies  for  each  func¬ 
tional  unit  are:  load  =  1,  store  =  1,  add  =  3,  multiply  =  5,  divide  =  19,  conversion  =  2. 

Results  for  the  baseline  FPU  are  shown  in  Table  3.4.  Many  criteria  could  be  used 
for  measuring  the  effectiveness  of  an  architectural  feature;  parts  of  this  study  will  use  float¬ 
ing-point  stalls  per  instruction  (FSPI)  and  overall  cycles  per  instruction  (CPI).  The  former 
is  a  more  direct  indication  of  the  improvement  of  the  FPU  alone,  whereas  the  latter  tends 
to  show  the  impact  on  overall  performance.  Even  for  programs  where  floating-point  activ¬ 
ity  dominates,  the  majority  of  work  is  still  done  in  the  integer  unit,  so  fairly  large  reductions 
in  FSPI  are  needed  for  significant  improvements  in  CPI.  Figure  3.4  shows  the  ratio  of  inte¬ 
ger  to  floating-point  instructions  for  the  benchmarks. 

3.5.2  Dual  Transfer  and  Issue  of  Instructions 

The  design  of  the  IPU  for  transferring  floating-point  instructions  and  data  is  some¬ 
what  constrained  by  pin  limitations.  Pins  that  are  available  for  transfers  (without  adding 
busses  to  the  IPU)  are  the  data  cache  input  and  output  busses  (each  64  bits),  and  the  data 
cache  tag  bus  (20  bits).  These  busses  allow  a  potential  transfer  of  two  floating-point  instruc- 


Table  3.4  lOIO  Baseline  Performance 


1010 

(not  pipelined' 

1010 

(pipelined) 

Benchmark 
(50M  Insirucuons) 

FSPI 

CPI 

FSPI 

CPI 

alvinn 

0.000 

1.5697 

0.000 

1.5695 

doduc 

0.605 

2.2460 

0.374 

1.8917 

ear 

0.593 

1.6657 

0.291 

1.3620 

fpppp 

0.526 

2.5826 

0.290 

1.6400 

hydro2d 

0.624 

1.8912 

0.480 

1.6154 

indljdp2 

0.840 

1.8771 

0.618 

1.6464 

mdljsp2 

0.852 

1.8650 

0.686 

1.7172 

nasa7 

0.836 

2.0442 

0.621 

1.7704 

ora 

0.705 

2.1715 

0.232 

2.0405 

spice2g6 

0.113 

1.8247 

0.246 

1.8084 

su2cor 

0.646 

2.1038 

0.321 

1.7100 

swm256 

0.927 

2.1441 

0.744 

1.8062 

tomcatv 

0.901 

2.5778 

0.645 

1.8724 

wave5 

0.828 

1.9666 

0.405 

1.8860 

Avg  Arithmetic 

0.547 

2.038 

'  0.422 

1.738 

Avg  Harmonic 

0.643 

1.999 

.  0.425 

1.722 

%  Change 

-33.9 

-13.9 

lions  per  cycle.  The  IPU  cannot  dual-issue  a  floating-point  instruction  and  an  integer  in¬ 
struction  which  both  reference  the  data  cache. 

Dual  issue  of  floating-point  instructions  in  the  FPU  is  an  indication  of  the  amount 
of  parallelism  available  within  the  instruction  stream.  Among  the  constraints  which  can 
prevent  dual  issue  are: 

1.  functional  units  conflicts, 

2.  true  data  dependencies, 

3.  blocking  functional  units  that  are  busy. 


4.  result  bus  conflicts. 
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Figure  3.4  Ratio  of  Integer  to  Floating*Point  Instructions  for  SPECfp92 

5.  the  instraction  queue  containing  fewer  than  N  instructions,  where  N  is  the 
degree  of  multiple  issue, 

6.  reorder  buffer  being  full. 

As  the  second  and  fifth  columns  of  Table  3.5  show,  there  is  an  improvement  of  15% 
in  CPI  for  dual  transfers  and  dual  issue  of  floating-point  instructions  over  that  of  single 
transfers  and  single  issue.  Not  surprisingly,  the  middle  2  columns  of  this  table  show  little 
improvement  over  a  single  transfer  and  issue  policy,  since  either  the  peak  issue  rate  or  the 
peak  transfer  rate  is  limited  to  only  one  instruction  per  cycle.  Table  3.6  shows  a  fairly  broad 
range  for  how  often  dual  transfers  are  utilized.  Exactly  when  a  dual  transfer  occurs  and 
what  pair  of  instructions  is  involved  is  as  important  as  the  frequency;  this  will  be  developed 
later  in  the  discussion  of  instruction  sequences  which  contain  floating-point  compares. 

Because  of  the  additional  complexity  of  issuing  more  than  2  instructions  per  cycle, 
higher  order  issue  was  not  examined.  Based  on  the  instruction  traces  for  the  benchmarks 
listed,  floating-point  loads  are  the  most  common  (24.0%),  while  addition/subtraction  in- 
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Table  3.5  Dual  Transfers  and  Multiple  Issue  of  2  Instructions 


Benchmark 
(50M  Instructions) 

Single 
Transfer 
Single  Issue 
(CPI) 

Dual 
Transfer 
Single  Issue 
(CPI) 

Single 
Transfer 
Dual  Issue 
(CPI) 

Dual 
Transfer 
Dual  Issue 
(CPI) 

aJvinn 

1.569 

1.5686 

1.5697 

1.327 

doduc 

1.655 

1.6551 

1.6421 

1.431 

ear 

1.228 

1.2287 

1.1706 

1.005 

hydro2d 

1.541 

1.5406 

1.5321 

1.287 

md]jdp2 

1.540 

1.5396 

1.5339 

1.390 

nasa7 

1.105 

1.1058 

1.1059 

0.953 

ora 

1.862 

1.8624 

1.8558 

1.525 

spice2g6 

1.841 

1.8409 

1.8296 

1.624 

su2cor 

1.509 

1.5084 

1.4949 

1,231 

Avg  Harmonic 

1.499 

1.500 

1.484 

1.272 

Avg  Arithmetic 

1.539 

1.539 

1.526 

1.308 

%  Change  from  single 
transfer,  single  issue 

0.0 

-1.0 

-15.1 

Table  3.6  Dual  TVansfer  Utilization 


Benchmark 

%  Transfers  Involving  2 
Floating-Point  Instructions 

doduc 

19.3 

ear 

10.8 

fpppp 

21.7 

hydro2d 

14.9 

nasa7 

30.6 

spice2g6 

5.0 

su2cor 

21.0 

swm256 

31.2 

tomcatv 

35.5 

Avg  Harmonic 

15.0 

Avg  Arithmetic 

21.1 
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struciions  (1 1.39c)  are  equally  as  likely  as  multiply  instructions  (9.69f ).  On  the  other  hand, 
divide  and  conversion  instructions  occur  quite  infrequently  (0.79c  and  3.99c,  respecti\  ely ). 
The  average  issue  rate  for  an  1000  policy  is  1.26,  as  shown  in  Figure  3.5.  This  raised  the 
question  of  whether  there  is  enough  parallelism  to  support  two  add  units.  Simulation  sug¬ 
gests  that  the  addition  of  a  second  add  unit  results  in  a  negligible  decrease  in  CPI  (<  19^). 
However,  since  the  benchmarks  are  compiled  for  a  machine  with  only  one  add  unit  (MIPS 
R2(XX)/R3(XX))  it  would  be  surprising  to  find  many  sequences  which  contain  two  successive 
instructions  that  both  use  the  same  functional  unit;  consequently,  the  benefit  of  implement¬ 
ing  two  add  units  should  probably  be  examined  further,  in  conjunction  with  code  reorder¬ 
ing. 


Multiple  issue  of  two  instructions  incurs  some  hardware  cost,  including  two  addi¬ 
tional  read  ports  on  the  register  file  and  reorder  buffer.  However,  a  sense-amplifier  based 
register  file  is  used  in  the  FPU;  this  design  should  allow  additional  read  ports  without  too 
large  of  a  penalty  in  speed  or  area.  The  reorder  buffer  will  have  a  small  number  of  entries 
so  its  performance  will  not  suffer  much  from  having  two  additional  read  ports.  The  instruc¬ 
tion  queue  also  needs  an  additional  read  port  to  allow  access  to  the  instruction  immediately 
below  the  head  of  the  queue.  Additional  source  operand  busses  are  needed,  as  well  as  some¬ 
what  more  complex  control  for  instruction  decoding  and  issue.  Extra  write  tag  ports  are  re- 


doduc 

ear 

fpppp 

hydro2d 

nasa7 

spice2g6 

su2cor 

swm256 

tomcatv 


1.6 


Avg  Harmonic  =  1 .26 
Avg  Arithmetic  =  1 .27 


CPI 

Figure  3.5  Issue  Degree  for  lOOO  Policy 
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quired  for  the  reorder  buffer  and  also  for  the  tag  lookup  table.  Many  of  these  issues  will  be 
revisited  in  Section  3.1 1.1,  which  discusses  the  merits  of  a  simpler  design  for  the  FPU. 

3.5.3  lOOO  versus  OOOO 

Out-of-order  issue  changes  the  issue  paradigm.  In  out-of-order  issue,  instructions 
are  allowed  to  dispatch  from  the  instruction  queue  to  a  reservation  station  for  the  appropri¬ 
ate  functional  unit.  The  constraints  for  dispatch  are  now; 

1 .  two,  one,  or  no  valid  Iq  entries, 

2.  free  reservation  stations  for  the  required  functional  unit,  and 

3.  free  reorder  buffer  entries,  which  must  be  reserved  at  dispatch  in  order  to 
retain  the  in-order  sequence  of  the  program. 

To  limit  the  number  of  reorder  buffer  and  register  file  ports,  the  degree  of  dispatch 
is  limited  to  two.  Note  that  dispatch  can  occur  in  spite  of  conditions  which  would  stall  an 
lOOO  policy,  including  the  operands  being  unavailable,  both  instructions  needing  to  use 
the  same  functional  unit,  or  the  instruction  needing  a  blocked  functional  unit.  The  actual 
issue  of  an  instruction  now  occurs  within  a  reservation  station  and  the  constraints  are: 

1 .  a  reservation  station  entry  is  valid,  meaning  that  all  necessary  operands  have 
been  forwarded  to  the  entry, 

2.  a  non-pipelined  functional  unit  is  not  blocked  with  a  prior  instruction,  and 

3.  a  result  bus  is  available  for  when  the  instruction  completes. 

An  upper  bound  on  the  performance  gained  by  using  an  OOOO  policy  for  both  in¬ 
teger  and  floating-point  instructions  can  be  derived  by  considering  both  the  percentage  of 
cycles  that  a  dual  issue  occurs  and  the  percentage  of  instructions  that  are  dual  issues.  Figure 
3.6  shows  these  figures  for  an  1000  Aurora  HI  model,  as  well  as  for  a  DEC  7000  AXP 
system  (which  is  not  an  out-of-order  issue  machine)  [Cvetanovic94].  Assuming  middle 
range  figures,  such  as  for  “fpppp”  for  the  Alpha,  dual  issues  occur  10%  of  the  time  and  ac¬ 
count  for  40%  of  all  instructions.  For  100  issues,  it  means  there  are  40  dual  issues,  60  single 
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Aurora  III 


aJvinn 

doduc 

car 

fpppp 

hydro2d 

mclljdp2 

mdljsp2 

nasa7 

ora 

spice2g6 

su2cor 

swm256 

tomcatv 

wave5 
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ear 
fpppp 
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swm256 
tomcatv 
waves 

0  10  20  30  40  50  60  70 

□  Dual  Issues  (%  Instructions) 

■  Dual  Cycles  (%  Time) 

Figure  3.6  Percentage  of  Dual  Issue  Instructions  and  Cycles 

issues,  and  400  cycles  overall.  Further,  this  means  that  there  are  300  NOP  cycles  due  to 
stalling.  For  the  best  case  scenario,  if  an  0000  policy  turns  all  of  these  single  issue  cycles 
into  dual  cycles,  there  would  be  a  savings  of  30  cycles,  or  7.5%.  This  analysis  assumes  that 


Dec  Alpha 


NOP  stall  cycles  are  relatively  constant,  being  caused  by  memory  system  latencies. 
Table  3.7  shows  this  upper  bound  across  all  of  the  benchmarks.  The  Alpha  improvements 
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are  quite  low  since  ver>'  few  cycles  actually  issue  2  instructions,  suggesting  long  memory 
latencies.  The  Aurora  Eli  estimates  include  dual  issue  of  both  integer  and  floating-point  in¬ 
structions;  however,  an  0000  policy  will  be  considered  only  for  the  issue  of  floating-point 
instructions.  Consequently,  there  will  be  fewer  dual  issue  cycles  and  the  performance  ben¬ 
efit  will  be  lower  than  suggested  by  Table  3.7. 

For  this  initial  set  of  experiments,  all  resources  were  set  to  a  maximal  amount 
(queues  and  reorder  buffer  having  20  entries  and  reservation  stations  having  15  entries),  in 
order  to  identify  an  upper  bound  on  the  performance  difference  between  the  two  issue  po¬ 
lices.  Some  interesting  results,  evident  from  the  initial  experiments,  are  summarized  in 
Table  3.8  to  Table  3.10.  First,  the  0000  policy  actually  has  worse  performance  than  the 

Table  3.7  Upper  Bound  for  OOOO  Performance  Improvement 


Benchmark 

(50M 

Instructions) 

Aurora  III 
(%  gain) 

DEC  7000  Alpha 
(%  gain) 

alvinn 

25.138 

5.391 

doduc 

23.575 

6.121 

ear 

33.396 

11.077 

^ppp 

35.277 

6.476 

hydro2d 

25.276 

5.283 

mdljdp2 

24.854 

5.318 

mdljsp2 

22.532 

7.091 

nasa? 

33.364 

6.457 

ora 

23.342 

8.121 

spice2g6 

21.319 

6.395 

su2cor 

27.208 

5.500 

swm256 

29.797 

10.149 

tomcatv 

26.932 

5.765 

waves 

21.864 

3.536 

Avg  Hannonic 

26.043 

6.132 

Avg  Arithmetic 

26.705 

6.620 
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1000  policy,  for  3  of  the  4  benchmarks.  Examining  the  3  high-level  stall  sources  for  the 
FPU,  it  is  clear  that  the  most  significant  source  involves  branches  that  must  wait  for  float¬ 
ing-point  compares.  Each  of  the  benchmarks,  except  for  spice2g6.  see  approximate!)  one 
additional  cycle  of  latency  before  a  compare  instruction  issues  from  a  reservation  station 
(0000)  versus  issuing  directly  from  the  queue  (1000).  In  modeling  the  0000  policy, 
an  initial  assumption  was  made  that  every  instruction  must  first  pass  through  a  reservation 
station  prior  to  issue.  In  other  words,  even  if  all  necessary  operands  are  ready,  the  instruc¬ 
tion  is  first  dispatched  to  the  reservation  station  of  the  appropriate  unit.  This  approach  in¬ 
troduces  the  unnecessary  additional  cycle  of  latency  evident  in  Table  3.10.  An  obvious 
solution  would  be  to  allow  issue  to  proceed  directly  from  the  instruction  queue  if  both  op¬ 
erands  are  valid.  This  adds  a  bit  of  complexity  to  the  dispatch  logic  and  also  an  additional 
mux  input  for  each  operand  to  the  input  stage  of  each  functional  unit.  Not  all  types  of  in¬ 
structions  suffer  from  this  additional  cycle  of  latency,  as  seen  in  Table  3.1 1  for  the  “hy- 
dro2d”  benchmark.  Still,  it  is  probably  more  efficient  to  allow  this  fast  bypass  of 
reservation  stations  for  all  instructions  than  to  recognize  only  a  few  cases.  The  impact  of 
doing  so  is  shown  in  Figure  3.7.  The  average  drop  in  CPI  across  the  benchmarks  is  a  modest 
1.2%,  though  some  applications  see  as  much  as  5.4%  improvement. 

While  an  lOOO  policy  sees  a  delay  from  the  time  the  reorder  buffer  is  written  to 
when  a  result  is  retired,  the  effect  is  even  worse  for  an  OOOO  policy,  which  explains  in  part 
why  the  measured  gains  in  performance  for  OOOO  are  quite  small  (or  even  worse). 
Table  3.11  shows  this  same  behavior  for  instractions  in  addition  to  floating-point  com- 


Table  3.8  Issue  Policies  (lOOO  vs  OOOO) 


Benchmark 

lOOO 

(XX)0  (no  fast  issue) 

OOCK)  (fast  issue) 

(SOM  Instructions) 

(CPI) 

(CPI) 

(CPI) 

ear 

1.005 

0.963 

0.951 

fpppp 

1.024 

1.030 

1.034 

hydro2d 

1.287 

1.347 

1.308 

spice2g6 

1.624 

1.644 

1.632 

Avg  Harmonic 

1.189 

1.1905 

1.178 

Avg  Arithmetic 

1.235 

1.2461 

1.231 
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Table  3.9  High  Level  FP  Stall  Sources 


Benchmark 
(SOM  Insiruciions) 

Iq/Lq  Full 

be  lx  Wait 

Wnie  Cache 
Eviction 

ear  (lOOO) 

9.97 

9.99 

0.00 

(OOOO) 

3.02 

13,47 

0.00 

fpppp  (lOOO) 

1.31 

1.43 

0.01 

(00(30) 

1.42 

1.64 

0.01 

hydro2d  (lOOO) 

1.90 

21.92 

0.00 

(OOOO) 

0.70 

35.77 

0.00 

spice2g6  (1(300) 

0.00 

17.76 

0.00 

(OOOO) 

0.00 

18.90 

0.00 

Table  3.10  Latencies  for  Floating*Point  Compare  Instructions 


Benchmark 
(SOM  Instructions) 

avg  latency 

avg  ROB  ready 

avg  issue 

avg  dispatch 

ear  (lOOO) 

8.175 

8.174 

5.175 

(OOOO) 

10.275 

9.135 

6.136 

4.129 

fpppp  (ICXX)) 

12.712 

11.666 

8.689 

(OOOO) 

14.421 

12.495 

9.515 

4.061 

hyclro2d  (10(30) 

11.032 

10.396 

7.396 

(OOOO) 

12.656 

11.057 

8.057 

4.353 

spice2g6  (ICXX)) 

14.579 

14.579 

11.579 

(OOOO) 

15.640 

14.171 

11.171 

4.481 

1.2  1.4  1.6  1.8 


OOOO 

Avg  Harmonic  =  1.122  Change  =  1.2% 
Avg  Arithmetic  =  1 . 1 64  Change  =  1.1°/ 

1000 

Avg  Harmonic  =  1 .1 37 
Avg  Arithmetic  =  1 .1 75 


CPI 

Figure  3.7  Comparison  of 1000  and  fast  OOOO  Policies 
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Table  3.11  Avg  Latencies  of  Various  Floating-Point  Instructions  for  Hydro2d 


FunciionaJ  Unit 

Insinjction 

couni 

9f  lota] 

instructions 

Avg 

latency 

Avg  ROB 
ready 

Avg  issue 

.A\g 

dispatch 

LOAD  (lOOO) 

12920135 

45.90 

16.643 

13.440 

12.256 

(OOOO) 

17.850 

9.484 

8.485 

6.630 

STORE  (lOOO) 

16.496 

13.099 

11.972 

(OOOO) 

17.707 

10.263 

9.325 

7.135 

ADD  (lOOO) 

3124150 

11.00 

19.899 

16.754 

13.481 

(OOOO) 

20.894 

14.008 

11.094 

7.658 

MULT  (lOOO) 

2736150 

9.70 

19.210 

14.900 

12.752 

(OOOO) 

19.869 

12.663 

10.745 

6.953 

LOADXTRA  (1000) 

2268717 

8.60 

8.932 

8.315 

7.174 

(OOOO) 

10,694 

7.788 

6,791 

4.800 

COMPARE  (lOCX)) 

2268717 

8,00 

11.032 

10.396 

7.396 

(OOOO) 

12.656 

11.057 

8.057 

4.353 

DIV  (lOOO) 

434402 

1.50 

41.043 

41,041 

(OOOO) 

38.902 

37.596 

18.638 

8.514 

coNv  aooo) 
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0.00 

6.643 

6.643 

4.643 

(OOOO) 

7.571 

6.571 

5.619 

4.619 

pares.  However,  only  store  instructions  share  the  same  constraint  as  compares  in  needing 
to  wait  until  they  reach  the  head  of  the  reorder  buffer  before  their  results  can  be  used  (the 
reason  is  discussed  in  Section  4.5);  other  instructions  can  bypass  their  results  in  the  same 
cycle  the  reorder  buffer  is  being  written.  Because  more  instructions  have  moved  past  the 
instruction  queue,  more  entries  in  the  reorder  buffer  are  necessary,  as  seen  in  Figure  3.8 
These  instructions,  and  their  corresponding  latencies  for  completion,  delay  the  time  at 
which  compares  and  stores  can  be  retired.  Having  more  such  entries  precede  a  compare 
simply  makes  the  compare  wait  longer  to  complete.  The  ability  to  retire  more  than  one  re¬ 
order  buffer  entry  per  cycle  would  help  alleviate  this  problem,  but  might  require  an  addition 
register  file  write  port.  For  GaAs  sense-amplifier-based  RAM’s,  write  ports  are  expensive 
compared  to  read  ports.  Some  floating-point  instructions,  such  as  compares  and  stores,  do 
not  produce  results  that  are  written  to  the  register  file,  so  it  should  be  possible  to  allow  two 
reorder  buffer  entries  to  be  retired  if  one  or  more  are  these  types  of  instruction.  Consequent¬ 
ly,  an  OOOO  policy  not  only  adds  state  in  the  form  of  reservation  stations  but  also  by  re- 
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0000 

Avg  Harmonic  =  8.08 
Avg  Arithmetic  =  1 1 .45 


■  1000 

Avg  Harmonic  =  3.81 
Avg  Arithmetic  =  7.16 

6  8  10  12 
Number  of  Entries 

Figure  3.8  Reorder  Buffer  Entries  Needed  for  lOOO  and  OOOO  Policies 

quiring  a  larger  reorder  buffer. 


Another  effective  approach  to  minimizing  stalls  due  to  floating-point  branches  in¬ 
volves  something  external  to  the  FPU,  the  memory  system.  Even  for  data  cache  hits,  a  3 
cycle  latency  is  costly,  underscoring  the  importance  of  a  technology  supporting  large  on- 
chip  memory  structures.  Some  of  this  latency  might  be  hidden  by  better  compiler  support. 
The  Aurora  HI  architecture  allows  only  one  load  to  be  issued  per  cycle.  Dual  issuing  loads 
might  be  justified,  though  doing  so  may  require  either  a  dual-ported  or  interleaved  first-lev¬ 
el  data  cache.  Using  both  hardware  and  software  compiler  approaches,  a  processor  should 
issue  load  instructions  to  the  memory  system  as  soon  as  possible.  The  above  analysis  as¬ 
sumes  a  17-cycle  latency  for  a  cache  miss,  which  is  probably  optimistic.  The  overhead  as¬ 
sociated  with  the  Aurora  HI  collision-based  bus  interface  unit  was  found  to  add  6  to  7  cycles 
to  the  overall  latency.  Collisions  occurred  more  frequently  than  anticipated,  which  in¬ 
creased  the  overall  cost  of  handling  a  memory  reference.  Some  of  these  benchmarks  thrash 
the  data  cache,  as  is  evident  in  the  average  load  latencies  shown  in  Table  3. 12.  The  latency 
for  a  load  which  hits  in  the  cache  should  be  approximately  6  cycles  (recall  that  these  laten¬ 
cies  are  measured  from  when  the  instruction  issues),  whereas  for  a  cache  and  prefetch  miss 
the  latency  would  be  20  cycles.  When  the  average  load  latency  is  in  the  range  of  8  to  11 
cycles,  as  seen  for  most  entries  in  the  tables  the  majority  of  loads  are  hitting  in  the  cache. 
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A  final  comparison  of  1000  and  0000  policies  contrasts  the  average  issue  rate  of 
each.  For  an  lOOO  policy,  the  peak  issue  rate  is  2,  whereas  for  an  OOOO  polic\  the  peak 
can  be  higher  since  issue  is  possible  from  the  reservation  stations  at  each  functional  unit. 
The  average  issue  rate  for  both  policies  cannot  exceed  2,  since  the  peak  dispatch  for  0000 
is  2.  In  order  for  one  policy  to  surpass  the  other,  the  average  issue  rate  must  be  higher. 
Table  3.13  shows  the  issue  rate  as  a  function  of  total  issues  and  number  of  issues  per  cycle. 
The  latter  parameter  indicates  how  often  issue  occurs,  while  the  former  makes  the  nature 
issue  evident.  The  number  of  issues  per  cycle  is  slightly  higher  for  the  0000  policy,  how¬ 
ever  the  average  issue  rate  per  issue  is  lower.  The  product  of  the  two  is  represented  in  the 
third  column;  there  is  little  if  any  difference  between  the  two  policies.  This  result  correlates 
well  with  the  very  small  change  in  CPI  that  has  been  observed;  an  0000  policy  simply 
does  not  succeed  in  increasing  the  number  of  dual  issue  cycles. 

The  discussion  up  to  this  point  has  focused  on  the  upper  bound  on  performance  af¬ 
forded  by  out-of-order  issue  and  has  assumed  a  large  resource  budget.  Consequently,  none 
of  the  experiments  showed  a  significant  stall  component  due  either  to  full  queues  or  write 
cache  eviction.  Only  about  half  of  the  benchmarks  were  limited  by  floating-point  branch 
stalls.  When  the  resources  are  reduced  to  a  more  reasonable  level  (Iq  =  6,  reorder  buffer  = 


Table  3.12  Breakdown  of  Average  Load  Latencies  (lOOO  Baseline) 


Benchmark 
(SOM  Instructions) 

Avg  Latency 
(cycles) 

Avg  Reorder 
Buffer  Ready 
(cycles) 

Avg  Issue 
(cycles) 

Avg  Issue  Point 
(cycles) 

doduc 

11.744 

7.956 

6.901 

5,897 

ear 

11.242 

9.136 

7.992 

7.377 

fpppp 

11.130 

6.542 

5.425 

4.541 

hydro2d 

13.890 

10.779 

9.623 

7.866 

nasa7 

16.397 

10.889 

9.761 

8.505 

spice2g6 

14.371 

13.367 

12.271 

9.116 

su2cor 

16.198 

11.895 

10.772 

9.271 

swm256 

15.993 

11.594 

10.394 

9.096 

tomcatv 

17.924 

13.139 

11.973 

10.272 
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Table  3.13  Issue  Rate  for  1000  and  OOOO  Policies 


Benchmark 
(SOM  Instructions) 

Issue  Rale  per 
Issue 

Issues  per  Cycle 

Issue  Rate  per 
Cycle 

doduc  (lOOO) 

i.23 

0.34 

0.42 

(OOOO) 

1.13 

0.37 

0.42 

car  (ICXX)) 

1.30 

0.37 

0.49 

(OOOO) 

1.23 

0.41 

0.51 

fpppp  (lOOO) 

1.22 

0.65 

0.80 

(OOOO) 

1.17 

0.68 

0.80 

hydro2d  (IO<X)) 

1.20 

0.36 

0.44 

(OOOO) 

1.12 

0,38 

0.42 

nasa7  (ICXX)) 

1.19 

0.78 

0.93 

(OOOO) 

1.09 

0.85 

0.92 

spice2g6  (lOOO) 

1.08 

0.06 

0.07 

(OOOO) 

1.01 

0.07 

0.07 

su2cor  (lOOO) 

1.31 

0.39 

0.52 

(OOOO) 

1.15 

0.45 

0.52 

swm256  (lOOO) 

1.54 

0.55 

0.85 

(OOOO) 

1.23 

0.71 

0.87 

tomcatv  (lOOO) 

1.40 

0.59 

0.82 

(OOOO) 

1.24 

0.67 

0.82 

8,  reservation  station  entries  =  2),  the  effect  on  CPI  for  both  policies  is  still  small,  as  sum¬ 
marized  in  Table  3.14.  There  is  only  a  modest  impact  on  performance  due  to  limiting  re¬ 
sources;  this  is  not  surprising  since  the  sizes  of  queues,  reorder  buffer,  etc.,  were  derived 
from  the  simulations.  The  difference  between  the  two  issue  policies  is  less  than  3%.  A  ques¬ 
tion  arises  about  whether  the  stall  profile  for  the  baseline  architecture  is  the  same  as  for  the 
larger  one,  considering  the  change  in  CPI  is  small.  Table  3.15  shows  that  the  profiles  are 
similar,  with  some  increases  in  queue  and  floating-point  branch  stalls  for  the  baseline  FPU. 

3.5.4  Reservation  Station  Selection  Policy 

For  an  0000  policy,  a  choice  must  be  about  how  to  select  an  instruction  to  issue 
from  a  reservation  station  when  more  than  one  instruction  is  ready.  A  simple  approach 
would  be  to  examine  instructions  in  a  first-in  first-out  (FIFO)  ordering,  which  is  easy  to  im- 
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Table  3.14  Resource  Allocation  for  lOOO  and  0000  Policies 


Benchmark 

lOOO 

Large 

(CPJ) 

lOOO 

Base 

(CPI) 

Speedup 

00(30 

Large 

(CPI) 

oooo 

Base 

(CPI) 

Speedup 

doduc 

1.4309 

1.4836 

1.0368 

1.4136 

1.4767 

1.0446 

car 

1.0049 

1.0204 

1.0154 

0.9507 

0.9546 

1.0041 

fpppp 

1.0236 

1.0362 

1.0123 

1.0338 

1.0920 

1.0563 

hydro2d 

1.2866 

1,3653 

1.0612 

1.3077 

1.3538 

1.0353 

nasa7 

0.9527 

1.0368 

1.0883 

0.9516 

1.0913 

1.1469 

spice2g6 

1.6238 

1.6464 

1.0139 

1.6322 

1.6325 

1.0002 

su2cor 

1.2309 

1.2859 

1.0447 

1.2170 

1.2986 

1.0671 

swni256 

0.9404 

1.0587 

1.1259 

0.8991 

1.0477 

1.1653 

tomcatv 

1.0811 

1.2813 

1.1852 

1.0716 

1.2916 

1.2053 

Avg  Harmonic 

I  1.137 

1.212 

1.062 

1.122 

L215 

1.076 

Avg  Arithmetic 

1.175 

1.246 

'  1.065 

1.164 

1.249 

1.081 

plement  via  the  head  pointer  of  a  queue.  An  alternative  might  involve  choosing  at  random 
one  instruction  from  the  pool  of  all  ready  instructions.  This  scheme  may  be  slightly  more 
diffrcult  to  implement  since  all  entries  in  a  reservation  station  would  need  to  be  examined 
concurrently.  The  logic  would  need  to  identify  those  instructions  that  are  ready  and  then 
select  one  instruction  to  issue.  It  is  possible  that  this  additional  logic  would  contribute  ad¬ 
versely  to  the  path-length  of  overall  instruction  issue,  though  the  actual  number  of  reserva¬ 
tion  stations  needed  per  functional  unit  would  be  quite  small.  Table  3.16  summari7.es  the 
results  for  both  of  these  polices  and  shows  that  there  is  very  little  difference  between  the 
two.  In  fact,  for  4  benchmarks  shown  a  FIFO  prioritization  yields  slightly  better  perfor¬ 
mance,  though  the  difference  is  not  significant.  This  result  is  consistent  with  intuition  be¬ 
cause  the  instruction  whose  result  is  needed  the  most  is  the  oldest  active  instruction,  which 
is  the  first  instruction  in  the  queue. 
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Table  3.15  Resource  Allocation  and  High  Level  Stall  Sources  (1000  Policy) 


Benchmark 
(50M  Insiruciions) 

FSPC  dotal) 

(St  cycles) 

Iq/Lq  Full 
(9f  cycles) 

be  lx  Wait 
(9f  cycles) 

W  me  Cache  Eviction 
(9r  cycles) 

doduc  (large) 

22.5 

2.6 

19.9 

0.0 

(base) 

26.4 

8.9 

17.5 

0.0 

ear  (large) 

20.0 

10.0 

10.0 

0.0 

(base) 

21.2 

10.2 

II.O 

0.0 

fpppp  (large) 

3.7 

1.3 

1.4 

0.9 

(base) 

5.4 

3.3 

1.5 

0.5 

hydro2d  (large) 

33.8 

1.9 

31.9 

0.0 

(base) 

40.1 

7.8 

32.2 

0.0 

nasa?  (large) 

12.6 

9.8 

2.8 

0.0 

(base) 

23.4 

20.8 

2.6 

0.0 

spice2g6  (large) 

17,8 

0.0 

17.8 

0.0 

(base) 

18.8 

0.0 

18.8 

0.0 

su2cor  (large) 

12.4 

3.2 

8,3 

0.9 

(base) 

j  20.8 

15.3 

5.3 

0.1 

swm256  (large) 

19.1 

12.0 

6.5 

0,6 

(base) 

32.0 

28.5 

3.3 

0.2 

tomcatv  (large) 

10.8 

1.9 

8.9 

0.0 

(base) 

34.4 

26.9 

7.5 

0.0 

Table  3.16  Reservation  Station  Entry  Selection  Policy 


Benchmark 

(50M 

Instructions) 

FIFO 

(CPI) 

Random 

(CPI) 

Speedup 

ear 

0.9497 

0.9507 

1.0011 

fpppp 

1.0341 

1.0338 

0.9997 

hydro2d 

1.3057 

1.3077 

1.0015 

spice2g6 

1.6325 

1.6322 

0.9998 

Avg  Harmonic 

1.177 

U178 

1.001 

Avg  Arithmetic 


1.230 


1.231 


1.001 
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Table  3.17  Branch-on-FPU  Stalls  (lOOO  Policy) 


Benchmark 

(50M 

Insiruciions) 

9r  of  Total 
Cycles 

doduc 

19.9 

ear 

9.9 

hydro2d 

31,9 

mdljdp2 

41.8 

mdljsp2 

45.6 

ora 

8.9 

spice2g6 

17.8 

su2cor 

8.3 

swm256 

6.5 

tomcatv 

8.9 

waves 

34.1 

3.6  Improving  The  Latency  of  Floating-Point 
Compare  Instructions 

Table  3.17  and  Table  3.18  show  the  importance  of  synchronization  stalls  due  to 
floating-point  branches.  Table  3.17  identifies  the  benchmarks  which  have  the  highest  per¬ 
centage  of  compare  instructions.  Table  3.18  shows  the  average  latencies  for  floating-point- 
compares  in  these  benchmarks,  including  a  breakdown  of  where  the  latency  occurs  (“mdl- 
jsp2”  is  not  included,  since  it  has  similar  behavior  to  “mdljdp2”).  The  difference  between 


Table  3.18  Compare  Latencies  for  High  Branch-Stall  Benchmarks 


Benchmark 

(SOM 

Instructions) 

Avg  StaU  Cycles 
per  Floating- 
Point  Branch 

Avg  Latency 

Avg  ROB 
Ready 

Avg  Issue 
(Issue  takes 
place) 

Avg  Issue  Point 
(Issue  can  take 
place) 

doduc 

8.97 

11.155 

9.671 

6.691 

4.726 

hydro2d 

9.70 

12.015 

10.378 

7.378 

6.272 

mdijdp2 

8.61 

10.173 

8.642 

5.642 

3.428 

spice2g6 

14.60 

15.562 

14.562 

11.563 

11.035 

waves 

10.56 

14.608 

13.281 

10.281 

7.408 
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the  average  number  of  stall  cycles  per  floating-point  branch  (column  2)  and  the  average  la¬ 
tency  for  a  compare  instruction  (column  3)  can  be  explained  by  instructions  which  inter¬ 
vene  between  the  compare  and  branch;  on  average,  1.4  instructions  fall  between  the 
compare  and  branch  across  these  five  benchmarks.  Latency  to  the  issue  point  is  an  indica¬ 
tion  of  how  significant  data  cache  misses  are  for  a  particular  benchmark;  “spice2g6’‘  has 
both  the  highest  issue  point  latency  and  the  highest  data  cache  miss  rate.  In  the  “waveS" 
benchmark,  load  and  compare  issue  are  further  delayed  by  prior  instructions  in  the  queue. 
These  latencies  are  quite  large;  reducing  them  is  an  important  part  of  optimizing  and  FPU 
design. 

The  Aurora  III  design  allows  only  one  floating-point  compare  to  be  outstanding  at 
a  time  to  simplify  the  interface  between  the  IPU  and  FPU.  Allowing  more  than  one  compare 
instruction  to  be  active  at  a  time  might  seem  to  offer  the  potential  of  reducing  branch-on- 
FPU  stalls.  However,  a  second  compare  is  seldom  encountered  while  a  first  one  is  still  ac¬ 
tive.  This  is  explained  by  the  fact  that  there  is  a  one-to-one  pairing  of  branch-on-compare 
with  the  actual  compare;  issue  of  the  branch  will  stall  until  the  compare  is  resolved. 

Table  3.10  shows  that  about  four  cycles  of  latency  are  due  to  the  time  required  for 
the  floating-point  instruction  to  reach  the  instruction  queue  in  the  FPU.  The  IPU  has  pipe¬ 
line  stages  for  register  fetch  (RF),  alu  execution  (ALU),  and  load-store  unit  execution 
(LSU).  Contention  for  the  data  cache  bus  can  add  more  latency  but  this  tends  to  happen  in¬ 
frequently  (see  Section  3.7.4).  Dispatch  from  the  queue  is  also  constrained  by  the  three  con¬ 
ditions  listed  in  Section  3.5.3;  however,  since  in  this  set  of  experiments  there  are  a  large 
number  of  reservation  stations  and  reorder  buffer  entries,  it  is  unlikely  that  dispatch  will 
stall  for  these  reasons. 

One  could  move  the  transfer  of  floating-point  instructions  up  in  time  to  the  same 
point  at  which  issue  occurs  in  the  IPU,  thereby  saving  the  two  cycles  that  correspond  the 
ALU  and  LSU  pipe  stages.  In  this  approach,  one  would  have  to  add  dedicated  busses  be¬ 
tween  the  IPU  and  FPU  for  instructions  and  data  (three  64  bit  busses)  to  avoid  stalling  the 
front  end  of  the  machine.  The  current  Aurora  HI  design  decouples  the  front  end  of  the  IPU 
pipeline  (IC,  RF,  ALU)  from  the  longer-latency  backend  (LSU).  When  a  floating-point  in- 
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struction  transfer  is  stalled  while  a  higher  pnoniy  event  takes  place  (such  as  a  cache  fill), 
the  issue  of  integer  instructions  is  not  inhibited.  As  will  be  discussed  later,  a  desire  to  limit 
pin  count  was  a  key  factor  in  choosing  the  current  organization.  However,  it  is  not  clear  that 
moving  the  transfer  of  floating-point  instructions  forward  in  time  would  necessarily  result 
in  lower  latencies  for  compares.  The  merit  of  an  early  transfer  point  depends  upon  hov\  of¬ 
ten  a  floating-point  compare  is  preceded  by  a  floating-point  load.  A  load  instruction  must 
still  pass  through  the  RF  and  ALU  stages,  and  will  see  a  three-cycle  latency  for  data  to  be 
returned  from  the  off-chip  pipelined  data  cache  (assuming  a  cache  hit).  If  this  pattern  occurs 
often,  the  latency  for  compares  will  be  roughly  the  same  as  in  the  current  system,  regardless 
of  tin  early  transfer.  For  the  benchmarks  which  are  limited  by  floating-point  branches,  ex¬ 
amining  the  characteristics  of  the  most  commonly  occurring  compares  reveals  the  follow¬ 
ing  sequences: 

1: 

Iwcl  $f4 

Iwcl  $f5  <=  $f4  and  $f5  comprise  a  single  double-precision  register 

integer  op 

mov.d  $  fD,$f4  or  sub.d  $fD,$f2,$f4 
cmp  $f0,$f2 

XX 

bclx 

2a: 

Iwcl  $f4 
Iwcl  $f5 
integer  op 
cmp  $f4,$fl0 

XX 


bclx 
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Table  3.19  Common  Compare  Instruction  Sequences 


Benchmark 

Compare 

Tola] 

Sequence 

Compares 

doduc 

1 

44.5 

hydro2d 

1 

18.0 

2A 

24.0 

3 

24.0 

mdljdp2 

1 

37.8 

3 

37.8 

waves 

2A 

40.0 

2B 

21.0 

3 

29.0 

spice2g6 

2A 

94,5 

2b: 


Iwcl 

$f4 

Iwcl 

$f5 

integer  op 

integer  op 

cmp 

$f4,$fl0 

XX 

be  lx 

3: 


integer  op 
cmp  $f8,$f2 
integer  op 
be  lx 

Table  3.19  shows  which  sequences  are  prevalent  for  the  five  benchmarks  in  ques 
tion.  The  first  important  observation  about  these  sequences  is  that  they  fall  into  three  cate 
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IHD 

[is3 

Memor>'  System  #1 
3  Cycle  Latency  f  dcO| 


FPU#1 
Aurora  III 


ITsU 

na 


FPU  #2 
Early  Transfer 


rwi 


I  Iwcl/nopl  (crr.p'bc  i 

[cmpl 

I  lwcl|  I  cmpi 

I  Iwcl] 

CEO 


|lwcI/nop|  Icmp/bclxl  \  bc\A 

1  bcl?(| 

|bcl>j 

1  bclxl 

[EO 

1  Iwcl/nopl  1  cmp| 

1  Iwcll 

|cmp| 

Hwcll 

1  Iwclll 

1  Iwclll 

jcmpl 

1  Iwcl/nopl  Icmp/bclxl  fbcT^ 

1  bcM 

fbcl^j 

I  bclxj 

1  bcl;j 

1  Iwcl]  1  Iwclll 

1  Iwclll 

1  Iwclll 

1  Iwclll 

Icmpl 

Time 


Figure  3.9  Timing  For  Aurora  HI  Memory  System  and  Early  Instr.  Transfer 
gories.  Sequences  in  group  1  have  two  intervening  instructions  between  the  load  and 
compare.  One  of  these  instractions  is  a  floating-point  operation  the  source  of  which  is  the 
load  and  the  result  of  which  is  used  by  the  compare.  Group  2  sequences  have  either  one  or 
two  intervening  integer  instructions.  The  group  3  sequences  are  characterized  by  the  fact 
that  the  operands  needed  by  the  compares  are  readily  available  and  no  issue  stalls  occur  due 
to  data  dependencies.  For  all  groups,  on  average  1.08  instructions  fall  between  the  load  and 
compare. 


Note  that  the  sequence  of  a  load  followed  immediately  by  a  compare  does  not  ap¬ 
pear;  this  case  cannot  occur  in  the  MIPS  architecture  because  of  the  load  delay  slot.  Figure 
3.9  contrasts  the  way  a  load-compare  sequence  progresses  though  the  current  Aurora  HI  ar¬ 
chitecture  and  a  design  which  moves  floating-point  transfers  forward  in  time.  The  compare 
cannot  be  issued  in  either  case  until  the  fifth  cycle;  the  early  transfer  design  holds  the  com¬ 
pare  in  the  instruction  queue  longer.  Having  more  instructions  between  the  load  and  com¬ 
pare  delays  the  arrival  of  the  compare  at  the  instruction  queue,  and  increases  the  chance  of 
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rwi 

1  Iwcl/nopI  icmp/bc  1  x| 

1 

lALU 

1  Iwcl/nop  1 1  cmp| 

bd 

jvicmor>  oysiciTi 

1  Cycle  Latency 

Icmpl 

1  dc  1 

1  Iwc  1| 

rw] 

1  Iwcl/nop  1  Icmp/bc  1x1  |bcl)4 

[beg 

FPU#1 

1  Iwcl/nop  1 1  cmp| 

Aurora  III 

1  Iwcl] 

IcmpI 

1  Jq  1 

1  iwcll 

IcmpI 

1  Iwcl/nop  1  Icmp/bc  1x1  fbcT^ 

r  be  i 

Early  Transferr— , — ^ 

nn 

1  Iwcll  1  Iwclll  1  IwcHl 

IcmpI 

- - C 

Time 

Figure  3.10  Timing  For  1-Cycle  Memory  System  and  Early  Instr.  IVansfer 
benefitting  from  an  earlier  instruction  transfer.  With  two  intervening  instructions  in  an  ear¬ 
ly  transfer  design,  the  compare  reaches  the  queue  one  cycle  before  the  load  issues.  In  the 
current  design,  the  compare  would  reach  the  queue  one  cycle  after  the  load  issues,  thus 
costing  one  cycle  of  latency.  With  three  or  more  intervening  instractions,  the  penalty  is  the 
two  cycles  corresponding  to  the  instruction  passing  through  the  ALU  and  LSU  pipe  stages. 

Compare  latency  can  be  improved  in  several  ways  in  order  to  reduce  branch-on- 
FPU  stalls.  Much  of  the  above  discussion  depends  on  the  Aurora  HI  architecture  and  a 
three-cycle  latency  for  a  cache  hit.  A  technology  which  supports  a  large  on-chip  single  cy¬ 
cle  cache  would  benefit  more  from  an  earlier  transfer  of  floating-point  instructions  (see  Fig¬ 
ure  3. 10).  Compilers  could  also  improve  the  performance  of  early  floating-point  instruction 
transfer  by  trying  to  schedule  more  instructions  between  the  load  and  compare  to  better  fit 
the  parameters  which  affect  FPU  stall  cycles.  Table  3.20  summarizes  the  results  for  three 
ideas  of  early  instruction  transfer,  faster  primary  data  cache,  and  better  code  scheduling. 
For  single  issue,  less  distance  is  needed  between  the  load  and  compare  to  benefit  from  an 
early  transfer  point.  As  the  table  shows,  a  faster  memory  system  is  also  a  requirement  for 
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Table  3.20  Branch-on-FPU  Stall  Cycles  for  Different  Organizations 


Single 

Issue 

Dual 

Issue 

FPU  Configuration 

Interval 
of  1 
insir. 

Interv'al 

of2 

instr.s 

Interval 
of  3 

instr.s 

Interval 
of  1 

instr. 

Interval 
of  2 

instr.s 

Interval 

of  3 

instr.s 

ImervaJ 
of  4 

instr.s 

Memorv 

System 

#1  (Aurora  III) 

3 

2 

2 

5 

4 

4 

3 

3  Cycle 
Latency 

#2  (Fast  Transfer) 

3 

1 

1 

5 

4 

4 

3 

#1  (Aurora  III) 

2 

2 

2 

3 

2 

3 

2 

1  Cycle 
Latency 

#2  (Fast  Transfer) 

1 

0 

0 

3 

2 

2 

1 

achieving  the  greatest  benefit  of  an  early  transfer.  A  dual  issue  processor  needs  more  useful 
instructions  to  be  scheduled  between  the  load  and  compare  to  have  the  desired  effect,  be¬ 
cause  it  retires  them  in  pairs.  An  early  transfer  point  results  in  the  best  performance,  but 
only  if  supported  by  both  better  code  scheduling  and  a  faster  memory  system.  These  op¬ 
tions  also  impact  the  number  of  entries  needed  for  the  instruction  queue,  as  shown  in 
Table  3.21.  In  these  cases,  the  early  transfer  policy  fills  the  queue  faster  than  instructions 
can  be  issued  from  it.  The  lower  right  design  points  are  the  best  in  each  of  the  4  configura¬ 
tions  for  memory  system  and  issue  degree. 

An  optimistic  upper  bound  on  the  performance  gained  by  these  changes  would  as¬ 
sume  that  compare  latency  is  dominated  by  the  speed  of  the  add  unit  and  not  by  the  delay 


Table  3.21  Average  Instruction  Queue  Entries  for  Different  Organizations 


Single 

Issue 

Dual 

Issue 

FPU 

Configuration 

Interval 
of  1 
instr. 

Interval 
of  2 
instr.s 

Interval 
of  3 
instr.s 

Interval 
of  1 
instr. 

Interval 
of  2 
instr.s 

Interval 
of  3 
instr.s 

Interval 
of  4 
instr.s 

#1  (Aurora  III) 

0.7 

0.6 

0.5 

0.9 

0.9 

0.7 

0.7 

#2  (Fast  Transfer) 

1.3 

1.0 

1.0 

1.4 

1.4 

1.3 

1.3 

#1  (Aurora  HI) 

0.3 

0.3 

0.3 

0.4 

0.4 

0.3 

0.3 

#2  (Fast  Transfer) 

1.0 

0.8 

0,7 

1.2 

1.2 

1.0 

1.0 
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I  Base 

Avg  Harmonic  =  1 .433 
Avg  Anthmetic  =  1 .444 
SPECtp92  =  303.8 

H  3-cycle  compare  unit 

Avg  Harmonic  =  1.208  Change  =  15.7*^ 
Avg  Arithmetic  =  1.219  Change  =  15.6° 
SPECtp92  =  336.7  Change  =  10.8% 
HI  2-cycle  compare  unit 

Avg  Harmonic  =  1.169  Change  =  18.4°, 
Avg  Arithmetic  =  1.184  Change  =  18.0° 
SPECfp92  =  342.8  Change  =  12.8% 

0  0.2  0.4  0.6  0.8  1  1.2  1.4  1.6  1.8 
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Figure  3.11  Upper  Bound  on  Performance  via  Improved  Compare  Latency 
involved  in  transferring  the  load  data  and  compare  instruction  to  the  FPU.  Figure  3.11 
shows  the  results  if  compare  latency  is  reduced  from  the  8+  cycles  discussed  above  to  either 
the  two  or  three  cycles  that  correspond  to  the  latency  of  the  add  unit.  The  faster  add  unit 
offers  only  a  2%  improvement  in  performance,  a  result  which  will  be  discussed  further  in 
Section  3.8.1.  Several  factors  end  to  reduce  the  benefits  of  improved  compare  latency: 

1)  Not  all  compare  instructions  are  preceded  by  a  load.  These  instructions  may  suffer 
similar  or  worse  latencies  due  to  data  dependencies.  However,  the  data  of  Table  3.19 
suggest  that  these  occurrences  are  fairly  infrequent. 

2)  Data  cache  misses  will  increase  the  latency  of  compare  instructions  beyond  that 
assumed  for  this  upper  bound.  Most  SPEC  benchmarks  have  relatively  small  data 
sets  and  experience  good  cache  hit  rates;  “spice2g6”  is  a  notable  exception. 

3)  A  compiler  will  not  be  able  to  schedule  an  arbitrary  number  of  useful  instructions 
between  the  branch,  compare,  and  load  instructions.  As  discussed  above,  smaller 
intervals  between  these  instructions  will  increase  branch-on-FPU  stall  cycles. 

4)  Some  latency  will  always  be  associated  with  the  compare  instruction  entering  and 
exiting  the  reorder  buffer.  Support  for  precise  memory  exceptions  requires  that  the 
compare  be  retired  only  when  it  reaches  the  head  of  the  reorder  buffer.  This  addi¬ 
tional  latency  is  on  the  order  of  1.5  cycles. 
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Which  components  of  compare  latency  are  due  to  the  inner  w  orkings  of  the  FPL  ? 
Table  3.18  shows  that  in  several  benchmarks  the  result  of  a  compare  spends  time  in  the  re¬ 
order  buffer  before  being  committed  to  the  status  register.  This  is  caused  by  other  entries 
that  precede  the  compare  having  not  yet  received  a  results  from  the  functional  units.  This 
delay  can  add  one  or  more  cycles  to  the  overall  latency  of  a  compare,  raising  the  question 
of  whether  it  is  necessary  to  wait  until  a  compare  reaches  the  head  of  the  reorder  buffer  be¬ 
fore  the  compare  is  retired.  Consider  the  following  sequence: 


1.  c.eq 

<=  a  floating-point  compare 

2.  Iw 

3.  bclx 

<=  branch  on  compare  being  true/false 

4.  c.eq 

5.  nop 

The  first  compare  (1)  completes,  and  upon  exiting  the  floating-point  add  unit,  writes 
the  status  register  with  a  condition  equal  to  one.  The  branch  is  then  taken,  followed  by  the 
second  compare  which  evaluates  to  zero.  At  this  point,  the  MMU  determines  that  the  load 
(2)  causes  a  page  fault.  The  integer  reorder  buffer  entry  corresponding  to  the  load  may  have 
reached  the  head  of  the  integer  reorder  buffer,  but  will  not  be  committed  to  the  status  reg¬ 
ister  due  to  the  exception.  All  instructions  which  follow  the  load  are  squashed,  the  page 
fault  is  serviced  and  execution  resumes  at  the  load.  However,  this  is  the  second  time  the 
branch  is  encountered  and  the  condition  in  the  floating-point  status  register  is  now  zero,  not 
one.  Consequently,  the  wrong  path  is  selected  for  the  branch.  On  the  other  hand,  waiting 
until  the  compare  reaches  the  head  of  the  floating-point  reorder  buffer  will  ensure  that  this 
error  doesn’t  occur,  since  there  will  also  be  an  entry  in  this  reorder  buffer  for  the  integer 
load  (see  Section  4.5).  As  in  the  integer  reorder  buffer,  this  load  entry  will  reach  the  head 
of  the  floating-point  reorder  buffer  and  stall  until  the  IPU  sends  the  FPU  a  signal  indicating 
that  the  load  caimot  cause  an  exception.  Though  it  is  impractical  to  remove  reorder  buffer 
latencies  from  compares,  as  has  been  shown,  it  is  interesting  to  see  how  much  effect  remov¬ 
ing  this  latency  might  have  on  overall  performance.  An  estimate  for  the  upper  bound  on 
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Table  3.22  Removing  Reorder  Buffer  Latency  for  Compares 


Benchmark 

Baseline 

(CPI) 

No  ROB  Laienc> 
(CPI) 

doduc 

1  484 

1.446 

hydro2d 

1.020 

1.293 

mdljdp2 

1.420 

1.356 

mdijsp2 

1.480 

1.410 

spice2g6 

1.646 

1.625 

su2cor 

1.286 

1.277 

lomcaiv 

1.281 

1.273 

waves 

1.589 

1.538 

Avg  Harmonic 

1.372 

1.383 

Avg  Arithmetic 

1.401 

1.392 

SPECfp92 

303.8 

308.9 

%  Change  from 
Baseline 

1.7 

savings  is: 


,  ^bc\x  instructions  that  stall  , 

^cIjc  instructions 

-  =  CPI 

instructions 

On  average  across  the  benchmarks  that  experience  branch-on-FPU  stalls,  the  life¬ 
time  for  a  compare  in  the  reorder  buffer  is  1.18  cycles.  Table  3.22  shows  that  this  reorder 
buffer  latency  has  a  very  minimal  effect  on  overall  performance. 

3.7  Memory  System  Issues 

This  section  focuses  on  improvements  for  the  memory  system  to  better  support  the 
execution  of  floating-point  code,  including  the  bandwidth  to  the  primary  data  cache,  the  use 
of  prefetching  to  minimize  the  absence  of  an  on-chip  data  cache,  the  use  of  split  integer  and 
floating-point  caches,  and  the  organization  of  the  ff  U-FPU  interface. 
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3.7.1  Double-word  Loads  and  Stores 

Currently,  most  applications  that  utilize  floating-point  numbers  use  the  double-pre¬ 
cision  format  which  requires  two  loads  or  stores  per  operand.  Since  loads  occur  with  a  259^ 
frequency  and  stores  happen  9%  of  the  time,  a  wider  path  to  the  memor>'  system  would 
seem  to  offer  significant  performance  gains.  However,  since  both  words  of  most  double¬ 
precision  operands  hit  in  the  data  cache,  the  pipelined  memory  system  of  the  Aurora  III  ar¬ 
chitecture  means  that  the  second  reference  delays  issue  by  only  one  cycle  (perhaps  a  bit 
more  due  to  bus  contention).  In  the  somewhat  unlikely  event  that  the  operands  are  in  sepa¬ 
rate  cache  lines,  a  cache  miss  for  the  second  word  could  impose  many  more  than  just  one 
additional  cycle  of  latency.  Assuming  the  former  case,  an  upper  bound  for  the  performance 
gained  by  supporting  double-word  references  is: 

~  (number  of  single  word  loads  )/2  +  cvc/e5,^,_, 

CPI  =  - — 

instructions 

This  relationship  assumes  that  load  latency  directly  equates  to  IPU-FPU  synchroni¬ 
zation  stalls,  which  for  an  upper  bound  is  not  an  unreasonable  assumption  since  we’ve  seen 
that  floating-point  branches  often  wait  for  compares  which  are  dependent  on  data  from  a 
floating-point  load.  If  the  number  of  loads  is  reduced  to  match  the  number  of  floating-point 
branches  that  stall  while  waiting  for  a  compare,  a  new  relationship  is: 

-  (floating-point  branches  that  stall )  +  cycles,  j  , 

CPI  =  - 

instructions^^^^l 

In  practice,  the  majority  of  comparisons  are  generated  by  a  few  sections  of  code. 
The  actual  performance  benefit  for  double-word  loads  may  lie  between  these  two  relation¬ 
ships.  Figure  3.12  gives  these  bounds  for  all  of  the  benchmarks.  The  benefit  of  double-word 
stores  will  be  low  because  they  occur  infrequently;  it  is  also  rare  to  see  synchronization 
stalls  due  to  an  inability  to  evict  floating-point  entries  in  the  write  cache  (refer  to 
Table  3.15).  To  confirm  these  estimates,  the  benchmarks  should  be  recompiled  to  take  ad¬ 
vantage  of  double-word  loads  and  stores.  This  will  require  the  use  of  a  newer  version  of 
Pixie,  which  is  unavailable  at  the  present  time. 


59 


alvinn 

dcxluc 
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fpppp 

hydro2d 
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nasa7 

ora 
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su2cor 

swm256 

tomcatv 

waves 
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Base 

Avg  Harmonic  =  1.291 
Avg  Anttimetic  =  1 .326 
SPECfp92  =  303.8 

Remove  1  cycle  for  each  FP  branch 
Avg  Harmonic  =  1 .272  Change  =  1 ,5°o 
Avg  Arithmetic  =  1 .303  Change  =  1 .7°/o 
SPECtp92  =  308.7  Change  =  1.6% 

Remove  1  cycle  for  half  of  all  FP  loads 
Avg  Harmonic  =  1.162  Change  =  10.0% 
Avg  Arithmetic  =1.216  Change  =  8.3% 
SPECfp92  =  334.2  Change  =  10.0% 


CPI 


Figure  3.12  Performance  Improvement  via  Double-Word  Load  Instructions 


3.7.2  Prefetching  of  Data 


Prefetching  floating-point  data  serves  to  offset  not  having  a  large  on-chip  primary 
data  cache  and  can  result  in  a  significant  increase  in  performance.  The  overall  improvement 
in  CPI  is  14.4%  and  some  applications  see  a  gain  of  as  much  as  60%.  These  results  are 
based  on  a  scheme  that  prefetches  only  with  a  unity  stride,  and  examines  only  the  top  entry 
of  the  prefetch  buffer;  both  of  these  constraints  might  warrant  reexamination.  Much  work 
has  been  done  elsewhere  concerning  hardware  and  software  prefetching  and  can  be  refer¬ 
enced  for  these  issues  [Chen94],  [Fu92],  [Klaiber91],  [Jouppi90]. 


3.73  Improving  Cache  Performance  for  Floating- 
Point  Code 


Since  floating-point  data  access  frequency  can  be  20-30%  of  total  accesses,  float¬ 
ing-point  memory  references  can  contaminate  integer  data  in  the  data  cache.  Integer  and 
floating-point  data  may  be  located  in  the  same  cache  line,  and  a  floating-point  miss  may 
flush  needed  integer  data.  The  use  of  a  split  data  cache  would  avoid  this  problem,  but  would 
require  much  additional  overhead  circuitry.  The  FPU  would  need  logic  to  generate  address¬ 
es  and  implement  a  cache  coherency  policy.  An  internal  report  from  Princeton  [Wolfe92] 
explores  this  idea  in  a  study  based  on  a  modified  version  of  Mike  Johnson’s  Match  simu- 
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lator.  The  integer  processor  contained  parallel  execution  units  for  loads,  stores,  branches, 
and  ALU  operations.  The  FPU  contained  units  for  loads,  stores,  adds,  conversions,  multi¬ 
plies,  and  divides.  Reserv'ation  stations  of  4  entries  were  used  on  all  units  except  for  divi¬ 
sion,  which  had  2  entries.  The  I-cache  was  16k,  4  words/line,  direct-mapped,  with  a  12- 
cycle  miss  penalty.  The  base  system  had  a  unified  data  cache  of  64K,  having  8-way  asso¬ 
ciativity,  8  words/line,  and  a  16-cycle  miss  penalty.  The  split  approach  used  a  16K  integer 
data  cache  and  a  64K  floating-point  data  cache.  A  snoopy  coherence  policy  w'as  used  for 
main  memory;  tags  for  both  data  caches  were  compared  simultaneously  in  order  to  resolve 
store  hits.  The  instruction  fetch  width,  and  corresponding  number  of  decode  units,  was 
found  to  be  four  (the  peak  parallelism  for  instruction  issue  was  five).  Eight  of  the  ten 
SPEC89  benchmarks  were  used  (espresso  and  gcc  were  not  used,  due  to  disk  constraints) 
and  no  more  than  8  million  instructions  (without  sampling)  were  run  for  any  one  program 
(possibly  limiting  the  accuracy  of  the  results).  Floating-point  performance  in  this  simulator 
was  almost  identical  performance  for  the  split  cache  and  the  64K  unified  cache.  Integer  per¬ 
formance  dropped  by  25-30%  because  of  the  reduced  size  of  the  integer  data  cache.  In  ad¬ 
dition,  the  amount  of  invalidation  between  the  two  caches  (a  measure  of  how  physical 
locality  of  integer  and  floating-point  data)  was  only  a  few  percent.  Even  if  these  invalidates 
were  totally  eliminated,  the  overall  gain  would  be  minimal.  Similar  performance  could  be 
obtained  by  simply  increasing  the  size  of  the  unified  cache,  without  the  extra  hardware  ex¬ 
pense  involved  with  splitting  the  caches. 

A  second  idea  for  reducing  cache  contention  between  floating-point  and  integer 
data  is  not  caching  selected  floating-point  references.  Floating-point  intensive  programs  of¬ 
ten  stream  through  matrix/vector  data;  if  the  lifetime  of  this  data  is  short,  caching  it  may 
not  make  sense.  Dynamic  stride-detection  and  data  prefetching  could  be  used  to  recognize 
these  cases  and  reduce  their  latency.  This  approach  is  more  practical  for  a  technology  like 
GaAs  which  cannot  support  large  on-chip  caches  because  it  reduces  the  occurrence  of  data 
being  purged  from  the  first-level  cache.  However,  the  Princeton  split-cache  study  suggests 
that  contention  between  integer  and  floating-point  data  may  actually  occur  infrequently. 
This  is  another  case  where  good  compiler  support  could  improve  performance  by  organiz- 
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ing  matnx  data  to  increase  its  lifetime. 

3.7.4  Interface  between  IPU  and  FPU 

As  mentioned  before,  the  existing  external  IPU  busses  could  be  used  in  several 
different  ways  to  handle  the  transfer  of  floating-point  instructions  and  data.  The  simplest, 
and  the  one  implemented  for  the  FPU,  uses  only  the  two  data  cache  busses,  dcin  and 
dcOut  (refer  to  Figure  3.1).  The  dcIn  bus  is  used  by  the  load-store  unit  to  send  data  to  the 
primary  data  cache  and  the  dcOut  bus  is  used  to  receive  data  from  the  data  cache.  From  the 
perspective  of  the  IPU,  these  busses  are  unidirectional.  However,  from  the  FPU 
perspective,  both  are  bidirectional:  instructions  and  data  can  be  sent  to  the  FPU  via  dcin, 
and  the  FPU  can  send  data  to  the  data  cache  via  dcin;  the  FPU  can  receive  data  from  the 
data  cache  via  dcOut,  and  the  FPU  can  send  data  to  the  IPU  also  via  dcOut.  Recall  that 
there  are  3  types  of  floating-point  queues:  instruction,  load  data,  and  store  data.  For  a  load 
instruction  to  be  transferred,  there  must  be  a  free  entry  in  both  the  instruction  and  load 
queues.  Each  entry  in  the  load  queue  has  a  valid  bit,  which  indicates  whether  the  data  in 
the  entry  is  ready  to  be  used;  the  data  comes  from  a  valid  write  cache  or  data  cache  hit,  or 
has  been  returned  from  the  secondary  memory  system  via  the  MMU  and  BIU.  Instructions 
that  transfer  data  from  the  IPU  register  file  to  the  FPU  register  file  will  set  this  valid  bit 
during  the  same  cycle  that  the  instruction  and  data  are  transferred.  On  the  other  hand,  load 
data  will  be  transferred  after  the  load  instruction,  and  then  in  a  later  cycle,  when  the  status 
of  the  memory  reference  is  known  the  valid  bit  will  be  set.  Issue  of  a  floating-point  load 
which  reaches  the  head  of  the  instruction  queue  will  be  stalled  until  the  valid  bit  for  the 
head  entry  of  the  load  queue  is  set.  This  approach  might  be  modified  to  issue  the 
instruction  prior  to  knowing  whether  the  data  is  valid,  but  the  complexity  in  doing  so  is 
significant  and  simulations  suggest  the  performance  benefit  is  small.  Since  the  data  cache 
busses  have  multiple  uses,  a  priority  policy  for  each  bus  needs  to  be  defined.  The 
following  summarizes  the  types  of  transfers  that  can  appear  on  each  bus. 
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dcin 

1 )  2  floating-point  instructions  (64b  total) 

2)  1  double  word  store  data  operand  (64b).  For  simplicity,  single  word  loads/ 
stores  will  also  be  sent  as  64  bits,  with  the  upper  half  padded  with  zeroes.  A 
pin  on  the  FPU  will  indicate  whether  at  least  one  store  queue  entry  has  valid 
data.  Store  data  must  be  sent  simultaneously  to  the  EPU  (in  order  to  write  the 
proper  write-cache  entry)  and  to  the  data  cache.  In  other  words,  for  a  store 
transfer  to  occur,  both  dcIn  and  dcOut  must  be  available.  Decoupling  this  into 
2  transfers  would  add  complexity,  which  is  probably  not  warranted,  since 
stores  infrequently  (9%  on  average). 

3)  1  move-to-PT^U  instruction  (32b)  and  1  operand  (32b)  from  the  integer  regis¬ 
ter  file.  As  mentioned  above,  the  valid  bit  for  the  corresponding  load  queue 
entry  will  be  set  immediately  upon  transfer  of  the  instruction  and  data. 

4)  1  single/double  word  load  data  operand  (64b).  The  majority  of  loads  (80  to 
95%)  hit  in  the  data  cache,  and  therefore  will  be  sent  to  the  FPU  via  dcOut. 
However,  some  loads  hit  in  the  write  cache  or  prefetch  buffer,  or  need  to 
come  from  the  secondary  memory  system.  In  these  cases,  the  data  are  trans¬ 
ferred  over  the  dcin  bus.  The  frequency  of  these  types  of  transfers  is  quite 
low,  on  the  order  of  a  few  percent  of  all  cycles. 

dcOut 

1) 1  single/double  word  load  operand  (64b)  on  a  data  cache  hit. 

2)  1  move-from-FPU  operand  (32b)  taken  from  the  floating-point  register  file. 
As  with  a  store,  this  transfer  occurs  via  the  store  queue. 

3)  1  single/double  word  store  operand  (64b). 

Among  these  bus  events,  the  transfer  of  load  data  is  given  the  highest  priority,  since 
an  unresolved  load  in  the  queue  will  block  other  instructions  from  issuing  and  can  eventu¬ 
ally  stall  instruction  transfer  if  the  queue  becomes  full.  A  deadlock  situation  might  result  if 
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instructions  were  given  a  higher  prionty  than  load  data  transfers.  The  only  u.ses  for  the 
dcOut  bus  are  for  load  data  (integer  and  floating-point)  and  for  store  data  (floating-point). 
As  mentioned,  stores  are  infrequent  but  loads  are  common.  Loads  via  dcin  occur  infre¬ 
quently,  but  when  they  do  occur,  there  is  a  good  chance  that  the  instruction  queue  has  be¬ 
come  full  and  has  stalled  the  IPU.  Therefore,  the  following  priority  policy  was  defined: 


dcIn 

1 .  1  single/double  word  load  operand 

2.  1  or  2  floating-point  instructions,  or 
1  move-to-FPU  instruction  with  data 

3.  1  single/double  word  store  operand 


dcOut 

1 .  1  single/double  word  load 
operand 

2. 1  move-from-FPU  operand, 
or  1  single/double  word 
store  operand 
3.  no  transfer 


4.  no  transfer 


The  architectural  simulator  reports  a  number  of  interesting  statistics  about  the  utili¬ 
zation  of  these  busses.  As  sununarized  in  Table  3.23  and  Table  3.24,  bus  activity  varies 
greatly  between  the  benchmarks;  the  busses  are  idle  for  some  programs  (“alvinn”  and 
“spice2g6”)  whereas  they  are  continuously  busy  for  other  benchmarks  (“fpppp”  and 
“nasa7”).  Floating-point  instructions  comprise  the  great  majority  of  transfers  over  the  dcin 
bus;  they  rarely  stall  due  to  a  load-data  transfer.  This  is  to  be  expected,  since  load  data  is 
transferred  via  dcin  only  on  write-cache  hits  and  data  cache  misses,  both  of  which  occur 
infrequently.  Conversely,  load-data  transfers  associated  with  data  cache  hits  dominate  the 
activity  on  the  dcOut  bus.  For  many  benchmarks,  store  data  transfers  occur  infrequently 
enough  that  they  seldom  contend  for  the  use  of  either  bus.  An  8-entry  write  cache  was  used 
in  these  simulations  and  even  those  programs  with  a  large  number  of  store  stalls  due  to  bus 
conflicts  experience  very  few  IPU  stalls.  A  more  realistic  4-entry  write  cache  still  has  few 
IPU  stalls.  Store  stalls  that  result  from  bus  contention  impact  performance  only  when  the 
store  queue  is  not  large  enough  to  accommodate  store  data  while  it  waits  for  access  to  the 
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Table  3.23  Utilization  of  dcin  Bus 


Benchmark 

Idle 

(of  lota)  cycles) 

9r  FP  Insir  s 
(of  total 
transfers ) 

Loads 
(of  total 
transfers } 

Stores 
(of  total 
transfers) 

FP  Inst: 
Stalls 
(of  total 
cycles) 

FP  Store 

Stalls 
(of  total 
cycles) 

aivinn 

99.2 

77.6 

0 

22.4 

0.000 

0.000 

doduc 

54.2 

76.6 

0.2 

23.2 

0.013 

8.834 

ear 

48.3 

84.7 

0 

15.3 

0.002 

35.535 

fpppp 

9.7 

72.7 

0.1 

27.2 

0.008 

45.802 

hydro2d 

50.3 

76.7 

3 

20.3 

0.282 

16.678 

mdljdp2 

50 

74.7 

0.2 

25.1 

0.022 

6.086 

mdljsp2 

62.5 

80.2 

0.1 

19.7 

0.010 

3.430 

nasa7 

10.4 

77.9 

2.1 

20  1 

j 

0.669 

26.865 

ora 

47.9 

64.3 

0 

35.7 

0.000 

18.056 

spice2g6 

92.1 

88.8 

8 

3.2 

0.002 

0.142 

su2cor 

37.1 

67.7 

2.6 

29.7 

0.861 

25.869 

swm256 

17.9 

78.4 

1.8 

19.8 

0.517 

39.727 

tomcatv 

12.8 

69.8 

4.7 

25.5 

1.272 

44.474 

wave5 

65.8 

81.9 

0.1 

18 

0.000 

2.285 

cache  busses. 


The  Aurora  HI  FPU  was  designed  to  use  the  existing  data  cache  busses  to  support 
the  transfer  of  floating-point  instructions  and  data  in  order  to  minimize  the  number  of  pins 
on  the  IPU.  This  approach  need  not  create  a  performance  bottleneck,  although  it  does  com¬ 
plicate  the  design  by  requiring  the  prioritization  of  bus  activity  to  be  merged  into  existing 
load-store  unit  control  logic.  This  decision  requires  the  interface  logic  in  the  FPU  to  con¬ 
sider  the  various  uses  of  the  two  cache  busses  and  to  ensure  that  instructions  and  data  are 
sent  to  the  appropriate  registers.  An  alternative  discussed  earlier  involves  the  use  of  dedi¬ 
cated  busses  between  the  IPU  and  FPU  for  transferring  instructions  and  data.  This  would 
allow  the  transfer  point  for  floating-point  instructions  to  be  moved  ahead  by  several  pipe 
stages,  but  would  require  adding  approximately  192  pins  to  the  IPU.  The  benefits  of  dedi¬ 
cated  busses  depend  on  each  application  and  on  the  degree  of  synchronization  between  the 
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Table  3.24  Utilization  of  dcOut  Bus 


Benchmark 

Idle 

(of  lota]  cycles) 

Loads 
(of  total 
transfers^ 

9c  Stores 
(of  total 
transfers) 

9f  FP  Store 

Stalls 
(of  total 
cycles) 

W  ntc 

Cache 

Eviction 

Stalls 

(4-enir>) 

VS  ntc 

Cache 

Eviction 

Stalls 

(8-enir>) 

alvinn 

88.2 

99.3 

0.7 

0.000 

0.00 

0.00 

doduc 

77.1 

81.6 

18.4 

4.044 

2.10 

0.00 

ear 

83.8 

93.5 

6.5 

8.829 

0.10 

0.00 

fpppp 

50.8 

82.8 

17.2 

26-785 

10.90 

0.91 

hydro2d 

78.2 

84.6 

15.4 

8.361 

1.20 

0.03 

mdJjdp2 

81 

74.4 

25.6 

2.097 

6.40 

1.15 

mdljsp2 

86.6 

73.7 

26.3 

1.196 

3.90 

0.08 

nasa7 

47.7 

84.2 

15.8 

17.426 

0.00 

0.00 

ora 

80.7 

60.4 

39.6 

7.247 

1.80 

0.00 

spice2g6 

87.4 

99.5 

0.5 

0.016 

0.00 

0.00 

su2cor 

75.3 

69.8 

30.2 

9.403 

12.0 

0.91 

swin256 

67.9 

1 

80.4 

19.6 

14.513 

19.60 

0.58 

tomcatv 

53.6 

78 

22 

26.230 

7.00 

0.00 

waves 

86.2 

80 

20 

0.644 

0.00 

0.00 

IPU  and  FPU.  Instruction  sequences  involving  floating-point  comparison  and  branch  in¬ 
structions  have  been  foimd  to  have  the  greatest  impact  on  performance.  In  order  to  benefit 
from  transferring  floating-point  instructions  earlier,  both  operands  must  be  ready  when  the 
compare  instruction  reaches  the  issue  point.  The  3-cycle  latency  for  a  data  cache  hit  in  the 
Aurora  IE  architecture  would  require  more  than  3  instructions  to  fall  between  load  and 
compare  instructions  in  order  to  take  advantage  of  an  early  transfer;  a  shorter  cache  latency 
would  favor  an  earlier  transfer  point,  while  dual  issue  reduces  the  number  of  cycles  for  in¬ 
tervening  instructions,  decreasing  the  impact  of  early  floating-point  instruction  transfer. 

3.8  Resource  Allocation  Issues 


There  are  a  number  of  issues  that  have  an  impact  on  the  cost  of  allocating  on-chip 
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resources,  including  ihe  complexir\  of  algonthms  used  for  the  various  functional  units,  the 
use  of  pipelining  to  support  higher  clock  frequencies  while  maintaining  throughput,  and  the 
selection  of  a  reasonable  number  of  entries  for  the  reorder  buffer  and  queues  in  the  FPU. 
RBE’s  are  used  to  contrast  the  performance  benefit  with  the  cost  of  these  resource  alterna¬ 
tives. 

3.8.1  Sensitivity  to  Functional  Unit  Latencies  and 
Pipelining 

In  this  section,  the  area  and  complexity  costs  of  reducing  latencies  will  be  discussed 
for  each  of  the  functional  units,  and  then  the  effect  on  overall  performance  of  this  reduced 
latency  will  be  evaluated. 

Numerous  floating-point  addition  optimizations  can  be  used  to  reduce  latency,  all 
at  the  expense  of  transistor  count  and  implementation  complexity.  These  include  parallel 
paths  for  alignment  and  normalization,  fast  generation  of  the  sticky  bit,  and  leading  one  pre¬ 
diction  for  normalization.  To  evaluate  the  cost/performance  ratio  for  these  features,  several 
adders  were  investigated.  A  two-cycle  add  unit  incorporating  these  approaches  occupied 
the  most  area,  due  primarily  to  its  use  of  two  53-bit  mantissa  adders.  A  three  cycle  add  unit 
resulted  in  an  area  reduction  of  20%.  Further  reduction  of  resources  of  adder  resources  re¬ 
sulted  in  four-  to  five-cycle  latencies. 

Conventional  approaches  to  multiplication  involve  a  partial  product  array  (3-2  or  4- 
2  carry-save  adders)  followed  by  a  carry-propagate  mantissa  adder.  Booth  recoding  of  the 
input  operands  can  be  used  to  reduce  the  number  of  levels  in  the  array  by  one,  at  the  expense 
of  adding  recoding  multiplexors.  Analysis  of  these  two  approaches  for  GaAs  DCFL 
showed  that  the  area  savings  of  Booth-recoding  are  small  when  compared  to  the  increase 
in  complexity  of  the  design.  Increases  in  capacitance  along  critical  paths  of  a  Booth  recoded 
multiplier  tend  to  offset  the  reduction  in  logic  depth.  Another  alternative  involves  the  iter¬ 
ative  use  of  a  smaller  array.  This  approach  reduces  the  size  of  the  multiplier  considerably, 
however  five  cycles  are  needed  to  produce  a  result.  Furthermore,  the  multiplier  is  not  pipe¬ 
lined,  forcing  subsequent  multiply  instructions  to  wait  for  the  current  instruction  to  com- 


67 


plete. 

Non-restoring  division  algorithms  can  be  enhanced  by  representing  intermediate  re¬ 
sults  in  a  higher  radix  redundant  form.  SRT-2,  SRT-4,  and  SRT-8  approaches  return  one. 
two,  and  three  bits  per  cycle,  respectively.  Latencies  for  these  divide  units  vary  from  20  to 
more  than  50  cycles.  Techniques  can  be  employed  for  performing  both  on-the-fly  conver¬ 
sion  from  redundant  to  sign-magnitude  form  and  on-the-fly  rounding  of  the  result. 

The  IEEE  floating-point  standard  specifies  that  conversion  to  and  from  all  formats 
should  be  possible.  For  single,  double,  and  integer  formats,  this  results  in  six  different  types 
of  conversion.  While  conversions  could  also  be  performed  in  the  add  unit,  doing  so  would 
impact  critical  paths  since  the  additional  muxes  and  control  required  would  result  in  an  in¬ 
crease  in  logic  depth.  A  separate  conversion  unit  can  be  designed  with  a  modest  amount  of 
hardware  (30K  transistors),  having  latencies  of  2  to  4  cycles. 

Figure  3.13  shows  CPI  performance  for  various  latencies  of  the  four  execution 
units.  Addition  and  multiplication  both  show  a  17%  change  in  CPI  for  latencies  ranging 
from  one  to  five  cycles.  For  addition,  latencies  of  two,  three,  and  four  correspond  to  inter¬ 
esting  realistic  designs.  An  add  unit  with  a  latency  of  two  results  in  only  a  2%  improvement 
in  performance  over  one  taking  three  cycles,  while  a  latency  of  4  is  5%  worse  than  a  latency 
of  3.  Similarly,  each  additional  cycle  of  latency  in  the  multiplier  reduces  performance  (as 
measured  by  CPI)  by  4%.  In  the  division  unit,  over  a  latency  range  of  10  to  30  cycles,  per¬ 
formance  changes  by  8%.  Conversion  instructions  occur  very  infrequently  and  have  little 
impact  on  overall  performance.  The  same  simulations  were  repeated  for  non-pipelined  ad¬ 
dition  and  multiplication  units.  Interestingly,  the  degradation  in  performance  is  less  than 
5%,  The  latches  used  for  pipelining  these  two  units  account  for  approximately  25%  of  the 
area  for  each  unit.  Not  pipelining  these  units  would  result  in  significant  savings  in  area  and 
power  dissipation 

3.8.2  Reorder  Buffer 


A  reorder  buffer  provides  many  benefits.  First,  if  an  out-of-order  completion  policy 
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Figure  3.13  Performance  vs.  Resource  Cost  for  Floating-Point  Functional 
is  used,  the  reorder  buffer  ensures  that  results  are  written  back  to  the  register  file  in  the  cor¬ 
rect  order.  Upon  issue,  an  instruction  reserves  the  next  available  location  in  the  reorder 
buffer  by  writing  the  result  register  to  the  entry  and  by  adding  the  reorder  buffer  tag  to  the 
correct  result-bus  shift  register  entry.  When  an  instruction  in  an  execution  unit  has  finished, 
this  tag  is  used  to  guide  the  result  into  the  correct  reorder  buffer  entry  and  the  valid  bit  for 
the  entry  is  set.  On  every  clock  cycle,  if  valid  data  is  at  the  head  of  the  reorder  buffer,  it  is 
written  back  to  the  register  file.  In  this  way,  the  reorder  buffer  serves  to  prevent  output  de¬ 
pendencies,  since  results  are  written  back  in  the  order  of  the  original  instruction  stream. 
Both  output  and  anti-dependencies  result  from  aggressive  compiler  scheduling  of  a  limited 
number  of  registers  (hardware  parallelism  is  limiting  available  instruction  parallelism). 
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rather  than  from  true  dependencies.  Register  renaming  can  be  used  to  reduce  their  impact. 
This  can  be  done  in  the  reorder  buffer  by  identifying  a  result  entn.’  with  the  specific  result 
register.  When  a  source  operand  is  needed,  the  reorder  buffer  and  the  register  file  are  ac¬ 
cessed  in  parallel,  the  former  being  addressed  using  the  register  name.  If  the  result  is  found 
in  the  reorder  buffer,  it  is  forwarded  as  the  correct  operand.  Multiple  result  references  to  the 
same  register  can  be  active  at  one  time  within  the  reorder  buffer;  hence  it  is  necessary  that 
the  most  one  be  returned.  As  an  alternative  to  an  associative  reorder  buffer,  w'hich  is  diffi¬ 
cult  to  implement  in  GaAs,  a  small  tag  lookup  table  is  used  to  find  the  most  recent  reorder 
buffer  tag  for  a  source  register.  The  valid  bit  in  the  entry  determines  whether  the  corre¬ 
sponding  data  in  the  reorder  buffer  is  available.  Finally,  the  reorder  buffer  can  be  used  to 
ensure  that  exception  handling  is  precise.  If  an  exception  occurs  during  the  execution  of  an 
instruction,  an  exception  field  in  the  reorder  buffer  entry  will  be  written  at  the  same  time  as 
the  exceptional  result.  When  this  entry  reaches  the  head  of  the  reorder  buffer,  an  exception 
is  signalled  and  all  remaining  entries  are  marked  as  invalid.  Consequently,  no  subsequent 
instructions  have  written  the  register  file,  and  after  handling  the  exception,  the  FPU  can  be 
restarted  at  the  instruction  which  follows  the  exceptional  one. 

The  average  number  of  entries  needed  in  the  reorder  buffer,  as  found  through  sim¬ 
ulation,  reflects  the  number  of  simultaneous  floating-point  instructions  in  execution  on  av¬ 
erage.  Another  way  to  view  the  size  of  the  reorder  buffer  needed  is  by  comparing  how  the 
average  number  of  entries  relates  to  the  average  number  of  floating-point  instructions  per 
basic  block,  as  shown  in  Table  3.25.  Several  observations  can  be  made  from  this  table. 
First,  the  average  size  of  a  basic  block  tends  to  be  much  larger  for  floating-point  intensive 
applications  than  for  integer  ones;  much  more  work  is  being  done  between  branches  for 
floating-point  code.  Second,  Table  3.25  suggests  some  correlation  between  the  number  of 
floating-point  instructions  per  basic  block  and  the  optimal  size  for  the  floating-point  reor¬ 
der  buffer.  This  reflects  the  effect  of  cache  misses,  which  tends  to  reduce  the  number  of  out¬ 
standing  floating-point  instructions.  As  discussed  above,  the  type  of  issue  policy  can  affect 
how  many  instructions  are  active  at  a  time  and  how  many  reorder  buffer  entries  are  needed; 
an  0000  policy  requires  more  entries  than  an  1000  policy  (an  lOIO  policy  does  not  use 
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Table  3.25  Basic  Block  Sizes  for  SPECfp92 


Benchmarks 
(Complete  Run) 

Instructions  per 
Basic  Block 

I 

Roating-Point 
Instructions  per 
Basic  Block 

Reorder  ButTer 
Entnes 

(KXX)  Policy) 

Reorder  Bufter 
Entnes 

((XXX)  Policy) 

alvinn 

13.97 

9.96 

1.17 

1.17 

doduc 

11.33 

6.86 

6.29 

8.75 

ear 

9.48 

6.00 

1.95 

14,35 

fpppp 

27.92 

21.72 

8.50 

10.22 

hydro2d 

8.65 

4.89 

3.78 

7.39 

mdljdp2 

11.19 

6.71 

6.60 

11.24 

mdljsp2 

9.60 

4.97 

3.59 

9.45 

nasa7 

28.35 

21,74 

17.45 

17.97 

ora 

8.80 

5.216 

7.03 

8,14 

spice2g6 

10.77 

1.543 

1.12 

2.36 

su2cor 

17.97 

12.73 

8.27 

11.46 

swm256 

38.97 

31.53 

7.87 

16.17 

tomcatv 

27.22 

24.08 

9.20 

14.41 

waves 

15.64 

9.97 

2.69 

5.38 

Avg  Harmonic 

13.54 

6.51 

3.4 

5.7 

Avg  Arithmetic 

17.13 

11.99 

6.1 

9.9 

a  reorder  buffer  at  all).  Consequently,  an  OOOO  policy  requires  roughly  the  same  number 
of  entries  as  there  are  floating-point  instructions  per  basic  block.  Figure  3. 14  compares  per¬ 
formance  to  resource  cost. 

3.83  Instruction  Queue 

Queues  have  been  a  component  of  numerous  past  designs.  For  instance,  in  a  DAE 
machine  the  instruction  stream  is  split  into  separate  decoupled  memory  and  execute 
streams  that  communicate  via  queues  [Smith86,  Smith87].  This  approach  offers  several  ad¬ 
vantages,  including  an  opportunity  to  issue  more  than  one  instruction  per  cycle  and  less 
sensitivity  to  memory  access  delays,  since  the  instruction  fetch  stream  is  allowed  to  run 
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Figure  3.14  Queue  and  Reorder  BuR'er  Resource  Allocation 
ahead  of  the  execution  stream.  Thus,  memory  accesses  can  be  dispatched  further  in  advance 
of  when  they  are  actually  used.  For  other  designs,  an  I/O  buffer  has  been  a  characteristic  of 
at  least  one  previous  FPU  [Takeda85],  and  instruction  queues  have  appeared  in  several  oth¬ 
er  machines  [Molnar89],  [Steiss91],  [Darley90]. 


The  use  of  an  instruction  queue  instead  of  a  single  entry  instruction  buffer  allows 
substantially  more  slip  by  the  ff  U.  Instead  of  stalling  due  to  FPU  data  dependencies  or  re¬ 
source  conflicts,  the  IPU  is  able  to  transfer  a  floating-point  instruction  and  continue  execu¬ 
tion.  The  IPU  will  stall  due  to  the  FPU  only  when  of  the  queue  becomes  full  or  when  it  has 
to  wait  for  FPU  results.  Decoupling  queues  also  serve  to  hide  extra  latency  caused  by  chip- 
crossings,  and  can  lessen  the  impact  of  having  different  clock  frequencies  for  the  IPU  and 
FPU,  a  situation  that  can  result  from  there  being  a  difference  between  the  bit-widths  of  data¬ 
paths  for  these  two  chips.  This  point  is  made  in  Table  3.26,  that  as  the  clock  frequency  of 
an  IPU  chip  (paired  with  a  250MHz  FPU)  is  increased,  the  CPI  does  not  degrade  as  fast  as 
the  IPU  frequency  increases. 


Figure  3.14  shows  how  performance  varies  for  different  queue  sizes.  For  a  dual  is¬ 
sue  policy,  a  queue  of  depth  5  achieves  nearly  the  same  performance  across  all  benchmarks 
as  a  queue  with  an  unlimited  number  of  entries.  The  same  experiment  was  performed  for  a 
single  transfer,  single  issue  processor;  the  results  (not  shown)  indicate  an  optimal  queue 
size  of  3  entries.  Considering  that  only  one  instruction  can  be  transferred  from  the  IPU  to 
the  FPU  per  cycle  in  such  a  system,  and  that  there  are  fewer  floating-point  instructions  ac- 
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Table  3.26  CPI  for  Queues  and  Different  IPL7FPL  Clock  Frequencies 


Benchmark 
(50M  Instructions) 

250MH2 
(IPU  =  FPU) 

300MHz 
(IPL’=  1.2  FPU) 

350.\lHz 
(IPU=  1.4  FPU) 

400MH7 
lIPU=  1.6  FPU t 

ear 

1.0049 

1.1603 

1.3170 

1,4941 

fpppp 

1.0236 

1.1172 

1.2678 

1.4257 

hydro2d 

1.2866 

1.4364 

1.5634 

1.7056 

spice2g6 

1.6238 

1.6740 

1.7001 

1.7318 

Avg  Harmonic 

1.189 

1.311  (10.3%) 

1.441  (21.2%) 

1.578  (32.7%) 

Avg  Arithmetic 

1  1.235 

1.347  (9.1%) 

1.462(18.4%) 

1.589  (28.7%) 

tive  at  a  time  in  such  a  machine,  the  smaller  optimal  queue  size  is  to  be  expected.  Figure 
3.15  shows  how  the  recommended  size  of  the  instruction  queue  varies  with  issue  policy. 
The  smaller  number  of  entries  needed  in  the  instruction  queue  for  an  0000  policy  can  be 
explained  by  the  fact  that  dispatch  constraints  are  less  restrictive;  as  a  result,  instructions 
s|)end  less  time  on  average  in  the  queue  before  reaching  a  reservation  station.  In  effect, 
look-ahead  is  being  expanded  with  the  larger  instruction  window.  This  decrease  in  queue 
entries  is  more  than  compensated  for  by  a  need  to  have  more  reorder  buffer  entries,  which 
are  more  expensive.  An  upper  limit  for  the  number  of  entries  in  the  instruction  queue  would 
be  the  number  of  entries  in  the  integer  reorder  buffer,  since  there  cannot  be  more  than  that 
number  of  active  floating-point  instructions.  The  use  of  an  instruction  queue  does  make 
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Figure  3.15  Instruction  Queue  Size 


1000  Policy 

Avg  Harmonic  =  4.1 
Avg  Arithmetic  =  7.0 

0000  Policy 

Avg  Harmonic  =  2.4 
Avg  Arithmetic  =  5.1 
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precise  handling  of  exceptions  more  costly  in  terms  of  hardware,  since  the  IPL'  may  be 
many  instructions  past  an  exceptional  instruction  when  the  exception  is  signalled.  Howev¬ 
er,  as  will  be  discussed  later,  there  are  several  ways  to  address  imprecise  floating-point  ex¬ 
ceptions. 

3.8.4  Load  Queue 

Use  of  an  instruction  queue  necessitates  a  corresponding  load  queue  in  which  to 
store  data  prior  to  its  being  written  into  the  floating-point  register  file.  However,  the  depth 
of  this  queue  does  not  need  to  be  as  great  as  that  of  the  instruction  queue.  As  shown  in 
Table  3.27,  for  an  lOOO  policy,  a  3  or  4  entry  load  queue  functions  well,  while  an  0000 
policy  needs  one  fewer  entry.  As  in  the  instruction  queue,  out-of-order  issue  moves  more 
instructions  past  the  queues  and  into  reservation  stations,  which  also  explains  the  need  for 
fewer  entries  in  the  load  queue.  The  recommended  size  of  the  load  queue  correlates  well 
with  the  fact  that  most  floating-point  operations  use  double  precision  operands,  each  of 
which  requires  2  single-word  load  instructions.  The  need  for  an  additional  entry  can  be  ac¬ 
counted  for  by  memory  system  latency.  Even  on  a  cache  hit,  there  is  one  cycle  delay  be¬ 
tween  arrival  of  the  load  instruction  at  the  queue  and  validation  of  the  data.  Of  course, 
additional  delay  is  incurred  on  a  cache  miss;  the  average  number  of  load  queue  entries  in 
the  table  accounts  for  cache  miss  latencies,  too. 

3.9  Miscellaneous  Issues 

This  section  will  address  a  number  of  unrelated  issues,  including  hardware  support 
for  square  root,  several  alternatives  for  implementing  floating-point  store  instructions,  the 
use  of  additional  result  busses  to  support  an  increase  in  parallelism  for  dual  issue  of  float¬ 
ing-point  instructions,  the  use  of  the  multiply  unit  to  perform  division,  and  the  trade-offs  in 
implementing  different  floating-point  operand  widths. 
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3.9.1  Hardware  Square  Root 

Since  transcendental  functions  occur  infrequently  in  the  SPECfp92  benchmarks, 
they  are  not  worth  implementing  in  hardware.  The  Aurora  III  FPU  omits  these  operations, 
depending  instead  on  the  compiler  to  link  in  softw’are  library  routines  to  perform  these  cal¬ 
culations.  However,  the  question  of  whether  to  support  square  root  in  hardware  is  not  as 
clear;  Figure  3.16  suggests  that  some  programs  would  benefit  from  doing  so.  Square  root 
capability  with  a  latency  of  approximately  21  cycles  can  be  built  onto  a  divide  unit;  this  re¬ 
quires  the  addition  of  both  a  lookup  table  ROM  for  deriving  an  initial  estimate  and  some 
extra  datapath  logic.  The  impact  on  area  is  small;  the  impact  on  critical  paths  may  be  slight¬ 
ly  greater.  The  performance  of  such  square  root  hardware  can  be  compared  to  the  MIPS  li- 


Table  3.27  Load  Queue  Size 


Benchmarks 
(50M  Instructions) 

Average  Load 
Queue  Entries 
aOOO  Policy) 

Average  Load 
Queue  Entries 
(CXXKD  Policy) 

alviim 

1.00 

1.00 

doduc 

3.07 

2.51 

ear 

3.60 

3.59 

fpppp 

3.54 

4.02 

hydro2d 

3.26 

1.78 

mdljdp2 

3.17 

1.68 

mdljsp2 

2.02 

1.28 

nasa7 

8.07 

8.38 

ora 

1.24 

1.01 

spice2g6 

1.71 

1.00 

su2cor 

3.09 

2.14 

swm256 

4.80 

4.31 

tomcatv 

4.36 

1.79 

waves 

1,88 

1.03 

Avg  Harmonic 

2.4 

1.7 

Avg  Arithmetic 

3.2 

2.5 

bran'  function  for  square  root,  which  executes  62  instructions,  with  a  CPI  for  the  routine  of 
2.09,  meaning  that  124  cycles  are  needed  for  each  square  root  reference.  .A  bound  on  CPI 
that  is  possible  for  a  hardware  square  root  instruction  is  described  by  the  following; 

Cycles^l^^  (Number  sqn  references  Latency -  (Number  sqn  instructions  per  reference 

instructions 

Latency^^„  is  the  number  of  cycles  added  by  a  square  root  instruction  to  the  total 
runtime.  In  the  worst-case,  the  FPU  will  stall  as  soon  as  the  square  root  instruction  is  issued, 
and  the  entire  latency  of  the  operation  will  be  added  to  the  total  cycle  count.  Intermediate 
points  in  the  design  space  might  assume  that  useful  integer  work  is  overlapped  with  the 
square  root  operation  and  that  the  effective  latency  is  either  0  or  1 0  cycles.  Figure  3.17  sum¬ 
marizes  the  impact  of  square  root  for  the  8  benchmarks  which  use  this  function.  While  the 
improvement  for  one  application,  “ora”,  is  quite  large,  the  average  improvement  is  a  more 
modest  7%  to  9%.  The  “ora”  benchmark  represents  a  class  of  programs  which  perform 
graphics  transformations  and  for  which  square  root  is  used  frequently.  Since  multi-media 
applications  are  increasing,  other  programs  of  this  type  should  be  examined  in  order  to  fur¬ 
ther  illuminate  the  cost/benefit  of  including  a  hardware  square  root  instruction.  The  rate  of 
performance  growth  for  processors  suggests  that  any  feature  which  adds  only  5%  to  overall 
performance  should  require  no  more  than  a  month  for  design  and  verification. 


Percentage 

Figure  3.16  Percentage  of  Instructions  Due  to  Square  Root 


76 


Base 

SPECtp92  =  303.8 

21  cycle  hardware  sqrt 

SPECfp92  =  324.4  Change  =  6.8% 

10-cycle  hardware  sqrt 

SPECtp92  =  327.5  Change  =  7.8% 

0-cycle  hardware  sqrt 

SPECtp92  =  330.5  Change  =  8.8% 


0  0.2  0.4  0.6  0.8  1  1.2  1.4  1.6  1.8 

Figure  3.17  Performance  Benefit  via  Hardware  Support  for  Square  Root 

3.9.2  Store  Policies 


□ 


Data  transfers  from  the  FPU  to  the  IPU  can  be  handled  in  several  ways.  As  with  oth¬ 
er  floating-point  instructions,  store  and  move-from-FPU  instructions  (both  will  be  referred 
to  as  store  instructions)  must  return  results  in  the  correct  order.  One  simple  approach  would 
be  to  stall  the  ITU  when  a  store  instruction  reaches  the  head  of  the  instruction  queue  if  the 
corresponding  source  register  has  a  write-back  pending.  However,  this  scheme  would  also 
stall  instructions  following  the  store  instruction  which  would  otherwise  be  able  to  issue.  In 
this  case,  two  kinds  of  out-of-order  issue  might  make  sense.  The  store  instruction  might  be 
issued  to  the  reorder  buffer,  where  it  would  await  completion  of  the  instruction  producing 
the  data  to  be  stored.  In-order  completion  would  be  ensured  by  the  reorder  buffer;  when  the 
store  entry  reached  the  head  of  the  reorder  buffer  and  contained  valid  data,  it  would  be  sent 
to  the  IPU  via  a  store  queue.  This  manner  of  issuing  store  instructions  requires  one  addi¬ 
tional  write  port  in  the  reorder  buffer  for  each  result  bus,  since  a  result  might  need  to  be 
written  to  both  its  own  entry  and  to  a  store  entry.  More  importantly,  the  store  transfer  would 
not  take  place  until  all  entries  ahead  of  it  in  the  reorder  buffer  had  been  written  back  to  the 
register  file. 

Another  approach  would  involve  the  use  of  a  separate  store  reorder  buffer.  A  tag 
would  be  written  to  the  store  reorder  buffer  at  issue  and  when  the  pertinent  data  was  avail¬ 
able,  it  would  be  written  to  both  reorder  buffers.  An  additional  result  shift  register  (de- 
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Table  3.28  Store  Policies 


Benchmarks 
(65M  Instruciions ! 

CPI 

Stall  on 
Issue 

CPI 

Issue 

to  ROB 

CPI 

Issue  to 
Store  ROB 

aJvinn 

— 

_ 

2.110 

doduc 

1.768 

1.770 

1.754 

ear 

1.136 

1.091 

1.091 

hydro2d 

1.099 

1.093 

1.081 

mdljdp2 

1,079 

1.072 

1.072 

nasa? 

1.065 

1.034 

1.032 

ora 

1.768 

2.002 

1.770 

spice2g6 

1.194 

1.204 

1.204 

su2cor 

1.683 

1.712 

1.656 

Average 

1.345 

1.347 

1.326 

%  Change 

0.2 

1.4 

scribed  under  Section  3.9.3)  would  be  needed,  but  a  register  lookup  table  would  not  be 
needed,  since  the  result  would  be  accessed  for  on-chip  use  only  through  the  regular  reorder 
buffer.  The  most  important  characteristic  of  the  two-reorder  buffer  approach  is  that  stores 
could  be  completed  ahead  of  previous  non-store  entries  that  have  not  yet  completed.  The 
peak  number  of  entries  needed  in  such  a  store  reorder  buffer  would  relate  to  the  largest  pos¬ 
sible  number  of  active  instructions,  although  the  average  number  of  entries  would  be  much 
lower.  As  indicated  in  Table  3.28,  stall-on-issue  and  issue-to-ROB  policies  would  achieve 
similar  performance;  an  issue-to-store  ROB  architecture  would  achieve  only  a  1.4%  im¬ 
provement  in  overall  CPI  compared  to  a  stall-on-issue  design  (individual  applications  saw 
as  much  as  a  4%  improvement  in  CPI).  The  small  performance  gain  for  a  store  reorder  buff¬ 
er  does  not  justify  its  use.  Store  instructions  are  used  fairly  infrequently  and  when  they  are 
encountered  there  is  a  high  probability  that  the  necessary  operand  has  already  been,  or  will 
soon  be,  computed,  since  the  most  commonly  used  floating-point  instructions  have  fairly 
short  latencies.  Table  3.29  shows  that  little  time  expires  between  a  floating-point  store 
reaching  the  issue  point  and  data  being  ready.  Support  for  an  appropriate  number  of  write 
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Table  3.29  Average  Delay  Between  Issue  of  a  Store  Issues  and  Data  Availability 


Benchmark 
(65M  Insirucuons) 

Average  Store  Reorder 
Buffer  Wait  for  Data 

doduc 

1.621 

ear 

1.036 

fpppp 

1.357 

hydro2d 

1.483 

nasaT 

1.378 

spice2g6 

1.079 

su2cor 

1.468 

swin256 

1.443 

tomcatv 

1.492 

Avg  Harmonic 

1.345 

Avg  Arithmetic 

1.373 

cache  entries  would  further  decouple  the  execution  of  floating-point  stores.  The  store  reor¬ 
der  buffer  would  also  face  problems  handling  memory  exceptions  and  preserving  a  precise 
execution  model. 

As  a  result,  the  Aurora  in  FPU  was  implemented  with  a  stall-on-issue  policy  for 
stores  in  conjunction  with  a  simple  store  queue.  To  ensure  precise  memory  exceptions,  this 
queue  is  written  with  data  that  has  passed  through  the  reorder  buffer.  The  store  data  origi¬ 
nates  from  the  floating-point  register  file,  reorder  buffer,  status  register,  or  HI/LO  registers 
(used  for  integer  multiplication).  Each  store  queue  entry  contains:  1)  the  data  operand, 
aligned  to  the  high  or  low  word,  depending  on  the  address  of  the  store,  or  aligned  to  the  low 
word  for  move-from-FPU  instructions,  2)  a  2-bit  tag  which  identifies  the  type  of  store 
(move-from-FPU  or  store  single/double),  and  3)  the  integer  reorder  buffer  tag  for  the  store 
instruction,  which  is  used  to  guide  the  data  to  the  correct  write-cache  or  integer  reorder 
buffer  entry.  A  single  FPU  pin  is  used  to  indicate  that  floating-point  store  data  is  available 
and  ready  to  be  transferred  to  the  IPU.  Table  3.30  summarizes  the  average  number  of  store 
queue  entries  needed  by  each  benchmark;  a  size  of  2  or  3  entries  is  consistent  with  the  ob- 
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Table  3.30  Average  Number  of  Store  Queue  Entries 


Benchmarks 
(50M  Instructions) 

Avg  Store 
Queue  Enines 
(lOOO  Policy) 

alvinn 

1.000 

doduc 

3.442 

ear 

1.108 

fpppp 

5.951 

hydro2d 

3.001 

nidljdp2 

3.176 

indijsp2 

1.873 

nasa? 

3.000 

ora 

1.933 

spice2g6 

1.433 

su2cor 

5.859 

swm256 

2.950 

tomcatv 

5.311 

waves 

1.303 

Avg  Harmonic 

2.135 

Avg  Arithmetic 

2.953 

servation  that  most  floating-point  stores  are  used  with  double-precision  operands. 

3.9.3  Result  Busses 


As  execution  of  instructions  increases  within  the  FPU,  more  demand  is  placed  on 
the  bus  used  to  write  results  from  an  execution  unit  to  the  reorder  buffer.  Consequently, 
having  additional  result  busses  is  beneficial,  especially  in  supporting  dual  issue  of  floating¬ 
point  instructions.  From  Table  3.31,  we  can  conclude  that  no  more  than  two  result  busses 
are  needed,  which  is  consistent  with  results  presented  elsewhere  for  designs  which  issue  on 
average  between  one  and  two  instructions  per  cycle  [Johnson91].  Each  result  bus  has  a  cor¬ 
responding  result  shift  register  (RSR),  each  entry  of  which  contains  a  reorder  buffer  tag  and 
a  functional  unit  designator.  The  output  of  the  functional  unit  indicated  by  the  designator  is 
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Table  3.31  CPI  Performance  for  Different  Numbers  of  Result  Busses 


Benchmark 
(65M  Insiruciions) 

One 

Bus 

(CPI) 

Two 

Busses 

(CPI) 

Three 

Busses 

(CPI) 

Four 

Busses 

(CPI) 

alvinn 

2.111 

2.111 

Ulllri 

doduc 

1.704 

1.688 

1.696 

1.696 

ear 

1.146 

1.127 

1.127 

1.127 

hydro2d 

1.052 

1.035 

1.035 

1.035 

mdljdp2 

1.086 

1.019 

1.019 

1.019 

nasa7 

1.279 

1.057 

ora 

1.775 

1.764 

spice2g6 

1.204 

1.204 

1.204 

1.204 

su2cor 

1.654 

1.627 

1.627 

1.627 

Average 

1.367 

1.312 

%  Change 

4.0 

4.0 

4.0 

written  to  the  appropriate  reorder  buffer  entry,  through  the  use  of  the  tag.  The  RSR  needs 
to  have  one  stage  of  depth  for  each  cycle  of  delay  in  the  functional  unit  with  the  longest 
latency.  By  constraining  divides  to  use  only  one  of  the  result  busses,  one  can  reduce  the 
number  of  entries  in  the  RSR  for  the  other  result  bus.  Alternatively,  the  Aurora  in  FPU  al¬ 
lows  divide  instructions  to  poll  the  result  busses  upon  completion  of  a  divide  operation;  this 
means  that  writeback  of  the  reorder  buffer  is  delayed  until  a  free  bus  can  be  acquired.  This 
approach  saves  stages  for  the  result  bus  shift  registers,  at  a  slight  cost  in  divide  latency. 

3.9.4  Divide  Performed  in  Multiply  Unit 

There  are  numerous  hardware  algorithms  for  performing  division,  including  restor¬ 
ing,  non-restoring  (SRT,  Sweeney-Robertson-Tocher),  and  successive  approximation  (i.e., 
Newton-Raphson).  The  Newton-Raphson  method,  which  involves  multiplication  by  a  re¬ 
ciprocal,  could  be  done  within  the  multiply  unit,  saving  the  area  of  a  dedicated  divide  unit. 
The  effects  of  doing  so  were  simulated  under  the  following  assumptions: 

1.  a  divide  can  be  issued  even  if  multiplies  are  currently  being  executed  in  the  multiply 
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Table  3.32  Divide  Performed  in  Multiply  Unit 


Benchmark 
(65M  Instructions) 

CPI 

alvinn 

2.127 

doduc 

2.038 

car 

1.623 

hydro2d 

1.115 

mdljdp2 

1.800 

nasa? 

1.294 

ora 

1.921 

spice2g6 

1.204 

su2cor 

1.929 

Avg  Harmonic 

:  1.587 

Avg  Arithmetic 

1.672 

%  Change  from  lOCX)  baseline 

-20.7 

unit,  since  each  will  be  using  different  stages  of  the  multiply  pipeline. 

2.  a  divide  or  multiply  cannot  be  issued  to  the  multiply  unit  if  a  divide  is  currently  being 
executed. 

The  results  of  this  experiment,  which  uses  an  lOOO  policy  and  a  3-entry  instruction 
queue,  are  shown  in  Table  3.32;  the  use  of  a  pipelined  multiply  unit  to  perform  divides  has 
a  significantly  negative  impact  on  overall  performance.  In  fact,  the  gains  derived  from  us¬ 
ing  an  lOOO  policy  and  an  instruction  queue  have  been  lost.  As  a  result,  a  separate  divide 
unit  has  been  implemented;  this  unit  has  been  omitted  from  the  current  FPU  design  due  to 
space  limitations.  It  would  be  interesting  to  determine  the  effect  of  not  doing  division  in 
hardware  at  all,  but  instead  in  software.  While  most  applications  avoid  division  (less  than 
1%  of  all  SPECfp92  instructions  are  floating-point  divides),  certain  programs,  such  as 
graphics  applications,  would  suffer  a  significant  performance  loss  without  a  hardware  di¬ 
vision  instruction. 
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3.9.5  Operand  Formats 

To  minimize  design  time  of  the  FPU,  single  precision  calculations  were  not  includ¬ 
ed  in  the  add,  multiply,  or  divide  units;  conversions  among  all  data  formats  are  supported 
by  the  conversion  unit.  The  effect  of  this  decision  on  overall  performance  is  minimized  by 
the  fact  that  the  ANSI  standard  for  C  requires  all  floating-point  calculations  to  be  performed 
using  the  double  precision  format.  Consequently,  for  the  two  single-precision  SPECfp92 
benchmarks  that  are  compiled  in  C,  the  compiler  converts  all  single  precision  operands  into 
the  double  precision  format  prior  to  performing  a  floating-point  arithmetic  operation.  Be¬ 
fore  a  result  is  stored  to  memory,  it  is  converted  back  to  the  single  precision  format.  All  of 
these  conversions  add  overhead  to  the  execution  of  the  program.  In  “alvinn”,  conversion 
instructions  account  for  1.0%  of  all  instructions  and  for  “ear”,  conversions  comprise  16.9% 
of  all  floating-point  instructions.  There  are  several  options  for  managing  this  problem:  1 ) 
change  all  program  variables  to  the  double  precision  format,  and  2)  support  single  precision 
operands  in  hardware.  In  either  case,  an  upper  bound  on  performance  improvement  can  be 
derived  from  the  following  relationship: 

Cycles =  Cycles^ij-  (Number  conversion  instructions 

This  expression  assumes  that  the  majority  of  conversion  instructions  are  used  to 
convert  between  the  single  and  double  formats,  and  not  between  the  word  format;  this  as¬ 
sumption  is  valid  for  the  “alvitm”  and  “ear”  benchmarks.  The  variable  represents 

the  average  number  of  cycles  needed  to  execute  a  conversion  instruction;  several  reahstic 
values  are  used  in  Table  3.33.  The  total  cycles  for  the  “alvinn”  benchmark  is  not  affected 
since  conversions  comprise  such  a  small  percentage  of  the  total  instructions  that  are  exe¬ 
cuted.  On  the  other  hand,  for  “ear”  it  is  clear  that  a  fairly  significant  improvement  in  per¬ 
formance  can  be  realized  by  eliminating  the  need  to  perform  these  conversions.  As  a 
reference  point,  if  the  performance  of  a  single  benchmark  improves  by  25%,  the  overall 
SPEC  rating  will  improve  by  1.3%.  The  other  4  benchmarks  that  use  single  precision  oper¬ 
ands  (mdljsp2,  swm256,  tomcatv,  wave5)  are  all  compiled  in  Fortran,  which  does  not  co¬ 
erce  data  into  a  double  precision  representation.  As  a  result,  the  CPI  values  for  these 
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Table  3.33  Overhead  for  Conversions  in  Single  Precision  Benchmarks 


Cycles 

Cycles 

Cycles 

Benchmark 

CPI  con.  =  10 

CPI  con.  rr,  =  >  =5 

CP'con.rr.^^^ 

and  Reduction 

and  9c  reduction 

and  9c  reduction 

al  vinn 

66.35M 

66.35M 

66.35M 

O.OO^c 

0.009c 

0.09r 

41.93M 

39.68M 

37.83M 

ear 

16.5^c 

20.69r 

24.79c 

benchmarks  already  assume  that  one  of  the  two  approaches  just  described  has  been  em¬ 
ployed  in  order  to  avoid  generating  additional  conversion  instructions.  Conversions  are  not 
significant  for  any  of  these  Fortran  programs,  accounting  for  less  than  a  few  percent  of  all 
instructions.  Across  all  5  single-precision  benchmarks,  these  alternatives  for  handling  sin¬ 
gle  precision  numbers  can  improve  performance  by  no  more  than  about  5%.  It  is  interesting 
to  note  that  the  Multi-Titan  FPU  [Jouppi88]  operates  only  on  double  precision  numbers  and 
that  the  RS/6000  internally  converts  from  single  to  double  operands  prior  to  performing  an 
operation. 

The  IEEE  754  specification  also  recommends  that  the  extended  format  be  included 
for  the  widest  basic  format.  This  leads  to  the  following  combinations:  single  and  extended 
single,  double  and  extended  double,  single  and  double  and  extended  double.  However,  in 
order  to  follow  this  and  support  the  double  precision  format,  it  would  be  necessary  to  have 
a  datapath  width  of  more  than  80  bits  for  the  extended  format.  It  is  possible  that  having  an 
extended  format  might  be  used  to  reduce  exceptions  in  sensitive  pieces  of  code.  However, 
an  extended  double  format  is  just  too  costly  in  terms  of  hardware  for  the  integration  con¬ 
straints  of  GaAs,  and  is  probably  not  justified  considering  how  infrequently  it  would  be 
used. 

3.10  Simulation  Accuracy  Issues 

Although  the  Aurora  IE  design  implements  static  branch  prediction,  the  simulator 
does  not  model  this  feature,  instead  assuming  perfect  prediction.  The  actual  performance 
of  the  processor  will  be  lower  than  what  has  been  presented  so  far,  and  an  estimate  for  the 
impact  of  miss-predicted  branches  is  presented.  The  combination  of  simulation  speed  and 
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size  of  the  design  space  to  be  explored  constrain  the  runtime  of  any  gi\  en  experiment.  The 
effect  of  simulating  only  the  first  50  million  instructions  is  examined  and  sampling  is  dis¬ 
cussed  as  a  means  of  improving  simulation  speed  and  accuracy. 

3.10.1  Branch  Prediction 


The  Aurora  HI  architectural  simulator  does  not  model  either  a  static  or  dynamic 
branch  prediction  policy.  Consequently,  all  branches  are  assumed  to  be  predicted  correctly 
and  no  effort  is  made  to  simulate  the  effect  on  CPI  of  recovery  due  to  miss-prediction. 
However,  the  measured  values  for  CPI  can  be  derated  by  using  a  simple  relationship: 


CPI. 


CPloid+  (CPiou 


branch  instructions 
instructions 


miss  predicted  branches 
branch  instructions 


•  cycles  needed  to  restart ) 


Table  3.34  summarizes  the  penalty  for  several  choices  of  prediction  rate  and  pipe¬ 
line  depth,  assuming  an  initial  CPI  of  1.3.  The  Aurora  HI  design  has  a  pipeline  depth  such 
that  restarting  issue  after  a  miss-predicted  branch  will  require  squashing  3  cycles  of  instruc- 


Table  3.34  Effect  of  Branch  Prediction  on  CPI  =  1.3  (New  CPI  and  %  Change) 


Prediction  Accuracy 

2  Cycle  Miss 

3  Cycle  Miss 

4  Cycle  Miss 

5  Cycle  Miss 

for  Integer  (25%  branches)  and 

Predict  Penalty 

Predict  Penalty 

Predict  Penalty 

Predict  Penalty 

FP  (5%  branches) 

CPI 

CPI 

CPI 

CPI 

%  Change 

%  Change 

%  Change 

%  Change 

2-Level  History  Table  (0.96) 

1.326 

1.339 

1.352 

1.365 

2.0 

3.0 

4.0 

5.0 

2-Bit  Counter  (0.89) 

1.372 

1.407 

1.443 

1.479 

5.5 

8.3 

11.0 

13.8 

1-Bit  Counter  (0.83) 

1.411 

1.466 

1.576 

8.5 

12.8 

21.3 

BTFN(0.61) 

■D^l 

1.680 

1.934 

29.3 

48.8 

2-Level  History  Table  (0.98) 

1.303 

1.304 

1.305 

1.306 

0.2 

0.3 

0.4 

0.5 

2-Bit  Counter  (0.96) 

1.305 

1.310 

1.313 

0.4 

0.8 

1.0 

1-Bit  Counter  (0.94) 

1.312 

0.9 

BTFN  (0.71) 

1.338 

1.375 

2.9 

5.8 

■QH 
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lions  that  are  currently  in  execution.  Branches  for  integer  code  occur  259<-  ot  the  time  and 
5%  for  floating-point  programs  [Yeh93];  this  is  related  to  the  obser\aiion  that  basic  block 
size  is  much  larger  for  floating-point  code  than  for  integer  code.  There  are  numerous  pre¬ 
diction  policies,  including:  1)  two-level  history  table,  2)  2-bit  history  counter.  3)  software 
profiling  (static),  4)  1-bit  counter,  5)  backward-taken,  forward-not-taken  (static),  6)  always 
taken  (static).  Table  3.34  focuses  on  the  first,  second,  fourth,  and  fifth  of  these  policies, 
since  these  represent  a  reasonable  range  of  possible  prediction  rates.  Clearly,  a  more  deeply 
pipelined  machine  needs  to  have  a  better  (and  possibly  more  costly)  prediction  policy. 
However,  even  for  the  worst  policy,  floating-point  CPI  increases  by  only  4.4%  for  the  Au¬ 
rora  in  FPU.  Losses  in  CPI  due  to  miss-predicted  branches  are  offset  to  some  degree  by  the 
positive  effects  that  are  not  modeled  in  the  simulator,  such  as  support  for  both  double-word 
loads/stores  and  64  floating-point  registers. 

3.10.2  Sampling  and  Simulation  Speed/Accuracy 

With  the  design  trade-offs  all  dependent  on  simulation  results,  one  is  naturally  in¬ 
terested  in  the  accuracy  of  trace-driven  simulation  and  whether  runtime  can  be  reduced. 
The  experiments  presented  depend  on  results  that  have  been  collected  from  simulating  the 
first  50  to  100  million  instructions  for  each  benchmark.  We  need  to  know  whether  the  be¬ 
ginning  portion  of  a  program  is  representative  of  the  entire  runtime,  especially  considering 
the  possibility  that  these  first  instructions  may  comprise  one-time  initialization  events? 
Short  of  executing  and  simulating  the  entire  program,  the  accuracy  of  the  results  can  be  in 
question.  Closely  tied  to  this  issue  is  the  problem  of  constraining  how  long  it  takes  to  per¬ 
form  an  experiment.  The  total  runtime  can  grow  rapidly  as  one  considers  different  points 
within  a  design  space  across  a  broad  set  of  benchmarks.  Sampling  is  a  technique  originally 
proposed  for  cache  simulations  in  order  to  address  these  concems[Laha88,  Liu93,  Poursep- 
anh94].  Extending  this  approach,  instead  of  running  continuously  for  a  certain  number  of 
instructions  the  simulator  would  start  sampling  at  different  times,  ending  each  time  after 
some  number  of  instructions.  This  sampled  instruction  trace  might  be  read  from  a  file,  al¬ 
though  this  would  increase  network  traffic  over  taking  traces  directly  from  program  execu- 
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lion,  which  would  be  problematic  if  numerous  machines  are  running  simulation.''  in 
parallel.  Alternatively,  the  trace  might  still  be  generated  on-the-fly  and  the  simulator  turned 
on/off  at  appropriate  intervals.  The  main  bottleneck  for  trace-driven  simulation  is  the  speed 
of  the  simulator  itself,  not  how  long  it  takes  to  execute  the  benchmark;  for  example,  current 
machines  are  fast  enough  to  run  many  SPEC  benchmarks  in  only  a  few  tens  of  seconds.  In 
order  to  avoid  inaccuracies  that  might  result  from  cold-start  cache  effects,  hit  rates  for  var¬ 
ious  cache  structures  (instruction,  data,  tlb)  are  derived  for  each  benchmark  and  are  applied 
to  the  sampled  instruction  and  data  traces.  The  accuracy  of  sampling  can  be  correlated  to 
long  instruction  runs  through  the  use  of  certain  comparison  metrics.  These  include: 

1.  The  number  of  integer  and  floating-point  instructions  per  basic  block.  This  is  a  mea¬ 
sure  of  the  distance  between  branches. 

2.  The  frequency  of  different  instruction  types. 

3.  Cycles  per  instruction  (CPI). 

4.  Breakdown  of  stalls  and  average  latencies  for  the  commonly  occurring  instruction 
types. 

Table  3.35,  Table  3.36,  and  Table  3.37  evaluate  the  error  involved  in  only  running 
for  the  first  50  million  instructions  of  4  benchmarks  by  comparing  these  metrics  against 
longer  runs  of  1  billion  instructions.  Three  of  the  four  benchmarks  experience  only  very 
small  differences  in  CPI  and  cache  hit  rates;  “spice2g6”,  the  exception,  experiences  both 
higher  cache  miss  rates  and  more  branch-on-FPU  stalls.There  is  more  variation  among  the 
3  floating-point  stall  sources  for  the  different  benchmarks;  however,  error  for  floating-point 
branch  stalls  is  fairly  small  for  the  application  (hydro2d)  which  is  most  impacted  by  this 
type  of  stall.  For  the  other  benchmarks,  these  3  stalls  each  occur  less  than  20%  of  the  time 
and  often  do  not  directly  affect  CPI  because  useful  integer  activity  is  performed  during  the 
floating-point  stall.  Instruction  frequency  changes  only  slightly  for  “fpppp”  and  “hydro2d”, 
but  changes  more  for  the  other  2  benchmarks.  The  occurrence  of  load  and  conversion  in¬ 
structions  increases  for  the  “ear”  benchmark  on  longer  simulations,  but  this  does  not  trans¬ 
late  into  a  large  difference  in  CPI.  The  “spice2g6”  benchmark  demonstrates  that  there  are 


87 


Table  3.35  Comparison  Metrics  (SOM  /  IG  Instrs.  and  Difference) 


Memc 

car 

fpppp 

hydro  2d 

spice2c6 

FP  Instruciions  per 
Basic  Block 

4.48/5.92 

32.145r 

27.47/24.67 

-10.19% 

4.80/4.82 

0.42% 

0.64/0.25 

-60.94% 

Instructions  per  Basic 
Block 

9.17/9.47 

3.27% 

33.56/30.82 

-8.16% 

8.50/8.53 

0.35% 

5.59/7.98 

42.75% 

CPI 

1.005/1.011 

0.06% 

1.024/1.020 

-0.39% 

1.287/  1.332 
3.49% 

1.624/  1.387 
14.59% 

I-cache  Hit  Rate 

99.90/99.93 

0.00% 

78.00/78.71 

0.91% 

99.94  /  99.97 
0.003% 

97.57/99.28 

1.75% 

D-cache  Hit  Rate 

99.14/99.92 

0.79% 

99.87/99.95 

0.08% 

92.47/92.49 

0.02% 

90.66/94.25 

3.96% 

Iq/LqFuU 
(%  cycles) 

9,97/14.52 

31.34% 

1.31/1.48 

12.98% 

1.90/1.85 

-2.63% 

0.00/0.00 

0.00% 

be  lx  Wait 
(%  cycles) 

9.99/16.55 

65.67% 

1.43/1.74 

21.68% 

31.92/34.21 

7.17% 

17.76/7,29 

58.95% 

Write  Cache  Eviction 
(%  cycles) 

0.0/ 0.0 
0.00% 

0.01  /O.Ol 
0.00% 

0.00/0.00 

0.00% 

0.00  /  0.00 
0.00% 

Table  336  Functional  Unit  Latencies  (SOM  /  IG  Instrs.  and  %  Difference) 


Functional  Unit 

ear 

fpppp 

hydro2d 

spice2g6 

Load  Unit 

22.20/28.40 

27.9% 

18.30/21.26 

16.2% 

16.64/17.56 

5.5% 

13.38/15.74 

17.6% 

Store  Unit 

22.89  /  23.15 
1.1% 

13.93/16.05 

15.2% 

16.50/17.31 

4.9% 

8.89  / 12.74 
43.3% 

Add  Unit 

28.87/29.75 

3.0% 

20.13/23.35 

16.0% 

19.90/20.76 

4.3% 

10.67/19.15 

79.5% 

Multiply  Unit 

28.55/29.70 

4.0% 

18.01/21.10 

17.2% 

19.21  /  20.01 
4.2% 

14.64/19.77 

35.0% 

Divide  Unit 

24.26/26.87 

10.8% 

40.59/40.20 

1.0% 

41.04/41.90 

2.1% 

23.51/27.95 

18.9% 

Conversion  Unit 

25.12/27.68 

10.2% 

8.86/8.99 

1.5% 

6.64/6.76 

1.8% 

9.41  / 13.37 
42.1% 

Comparisons 

8.18/9.17 

12.1% 

12.71  / 11.02 
13.3% 

11.03/12.04 

9.2% 

14.58/17.34 

18.9% 

Miscellaneous 

27.92/29.06 

4.1% 

8.98/7.86 

12.5% 

8.93  /  9.94 
11.3% 

5.99/15.20 

153.7% 

Instniction 


FP  LOAD 


FP_STORE 


FP^ADD 


FP_MULT 


FP_DIV 


FP^CONV 


FP_COMPARE 


FP_ABS 


FP.NEG 


FP^MOV 


FPJCIX 


INT.LOAD 


INT.STORE 


INT.ALU 


INT3RANCH 


INT_NOP 


Avg  Difference 
(for  instr.s  with  a 
frequency  >  5%) 


ear 

fpppp 

0.10/0.13 

46.08 

0.41/0.41 

-0.73 

0.01  /o.oo 

-92.31 

0.00/0.00 

-50.00 

0.07/0.08 

20.29 

0.16/0.15 

-5.56 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.04/0.06 

48.65 

0.10/0.10 

-0.96 

0.05/0.07 

46.94 

0.12/0.12 

-2.40 

0.01/0.00 

-92.31 

0.00/0.00 

-50.00 

0.17/0.21 

24.70 

0.00/0.00 

0.10 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.00/0.01 

55.56 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.01/0.02 

42.86 

0.00/0.00 

100.00 

0.15/0.15 

-0.65 

0.42/0.42 

0.48 

0.03/0.00 

-92.31 

0.00/0.00 

133.33 

0.24/0.20 

-13.98 

0.12/0.11 

-6.03 

0.00/0.00 

0.00 

0.00/0.00 

0.00 

0.09/0.07 

-30.11 

0.04/0.05 

23.81 

18.74% 

2.50% 

hycro2d 


0.26/0.26 

0.00 


0.00/0.00 

0.00 


0.08/0.08 

0.00 


0.00/0.00 

0.00 


0.02/0.02 

0.00 


0.06/0.06 

0.00 


0.00/0.00 

0.00 


0.00/0.00 

0.00 


0.00/0.00 

0.00 


0.00/0.00 

0.00 


0.00/0.00 

0.00 


0.04/0.04 

0.00 


0.04/0.04 

0.00 


0.26/0.26 

0.39 


0.00/0.00 

-12.50 


0.26/0.26 

-0.39 


0.00/0.00 

0.00 


0.09/0.09 

0.00 


spicc2c6 


0.08/0.02 

-78.31 


0.00/0.00 

-100.00 


0.00/0.00 

-50.00 


0.00/0.00 

0.00 


0.00/0.00 

0.10 


0.00/0.00 

-100.00 


0.00/0.00 

-100.00 


0.00/0.00 

0.00 


0.00/0.00 

0.40 


0.00/0.00 

0.00 


0.02/0.00 

-71.43 


0.22/0.20 

-8.60 


0.06/0.07 

21.43 


0.25/0.38 

53.23 


0.00/0.00 

0.00 


0.20/0.14 

-27.18 
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cases  where  simulating  only  an  initial  section  of  a  program  may  lead  to  inaccuracies.  The 
Aurora  III  simulator  does  not  make  use  of  sampling,  but  probably  should  do  so  in  order  to 
increase  both  confidence  and  simulation  throughput. 

3.11  Evaluation  of  Final  FPU  Designs 

In  this  section,  the  wide  range  of  architectural  trade-offs  presented  in  this  disserta¬ 
tion  are  compared  via  five  Aurora  in  FPU  implementations,  including  a  250MHz  baseline 
FPU,  a  design  which  improves  CPI  through  more  complex  architectural  features,  a  simpler 
design  which  achieves  a  higher  clock  frequency,  and  two  designs  which  project  the  perfor¬ 
mance  gains  obtained  from  the  process  technology  improvements. 

3.11.1  Turning  Off  Architectural  Features  to 
Increase  Clock  Frequency 

There  are  many  choices  to  be  made  in  implementing  a  processor  design,  and  com¬ 
pletely  different  approaches  can  often  achieve  similar  performance.  This  is  quite  evident  in 
the  two  prev2dent  styles  for  building  microprocessors,  one  which  emphasizes  the  extraction 
of  instruction-level  parallelism  through  greater  complexity  and  the  other  which  focuses  on 
simplifying  a  design  in  order  to  more  easily  benefit  from  technology  improvements.  The 
Sun  UltraSparc  processor  is  typical  of  the  former,  while  the  Dec  Alpha  embodies  the  latter 
design  philosophy.  Some  designs  adopt  characteristics  of  both  approaches,  such  as  the 
MIPS  RIOOOO.  Similar  trade-offs  were  considered  for  the  Aurora  HI  design;  it  is  interesting 
to  consider  whether  a  simpler  design  might  have  achieved  the  same  level  of  performance. 
Consider  the  following  summary  of  critical  paths  (all  having  a  path-length  of  20  gates)  for 
the  current  FPU  design: 

1.  write  enable  for  destination  register  field  of  ROB  =>  destination  register  field  of  ROB  =>  scoreboard 
translation  table,  containing  most  recent  ROB  reference 

2.  add  unit  exception  signals  =>  selection  of  exceptional  constant  =>  write  exception  field  of  ROB  entry 

3.  instruction  queue  head  and  tail  pointers  =>  logic  that  derives  the  queue  full  signal 
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4.  register  field  for  insinjcuon  queue  enir\  =>  scoreboard  read  pon  for  ROB  tag  =>  read  pon  for  valid 

bit  of  ROB  entr> 

5.  size  of  data  in  ROB  entr\'  =>  data  dependency  logic  =>  issue  determination  =>  reserve  result  bus 

6.  size  of  data  in  ROB  entry  =>  data  dependency  logic  =>  issue  determination  =>  logic  to  set  scoreboard 

busy  bit  and  ROB  tag 

7.  size  of  data  in  ROB  entry  =>  data  dependency  logic  =>  issue  determination  =>  write  translation  table 

which  supports  precise  memory  exceptions 

8 .  size  of  data  in  ROB  entry  (single  or  double  precision)  =>  data  dependency  logic  =>  issue  determination 

=>  signal  to  stall  issue  while  writing  the  status  register 

9.  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  reserve  ROB  entry  (update  tail 
pointer) 

10.  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  selection  of  rounding  mode 

1 1 .  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  selection  of  ROB  entry  to  be 
reserved  upon  entry 

12.  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  reserve  store  queue  entry 

13.  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  advance  head  pointer  of  load  queue 

14.  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  signal  which  initiates  an  arithmetic 
comparison 

15.  ROB  valid  =>  data  dependency  logic  =>  issue  determination  =>  signal  to  initiate  multiply 

16.  integer  memory  exception  =>  translation  table  to  find  floating-point  ROB  entry  corresponding  to 
memory  reference  =>  signal  to  clear  aU  ROB  valid  bits 

17.  head  entry  of  ROB  generates  an  exception  =>  logic  to  clear  state  throughout  FPU  and  discard  instruc¬ 
tions  that  follow  the  exceptional  one 

18.  counter  to  time  completion  of  a  divide  operation  =>  divide  unit  is  free  =>  issue  determination  =>  ini¬ 
tiate  divide 

19.  external  instruction  transfer  type  =>  predecode  logic  =>  create  result  class,  which  indicates  whether 
the  instruction  produces  a  result 

20.  external  instruction  transfer  type  =>  logic  for  writing  an  instruction  queue  entry 
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These  paths  are  representative,  but  not  comprehensive,  of  the  overall  set  ol  critical 
paths  in  the  FPU  design  and  many  of  these  are  driven  by  additional  inputs  which  originate 
from  other  sections  of  the  chip.  Consequently,  there  are  more  parallel  paths  that  derive 
these  signals  and  are  of  critical  gate  depth.  For  example,  since  operands  may  be  found  ei¬ 
ther  in  the  register  file  or  reorder  buffer,  the  valid  bits  for  both  constructs  are  used  to  eval¬ 
uate  whether  data  dependencies  prevent  instruction  issue;  the  logic  depth  is  similar  for 
reading  the  valid  bits  from  both  the  scoreboard  and  reorder  buffer.  Similarly,  there  are  a 
number  of  constraints  besides  data  dependencies  that  are  used  to  determine  whether  issue 
is  possible  and  these  are  not  explicitly  represented  in  the  above  list.  Instead,  only  one  path 
to  each  output  signal  is  shown.  Also,  this  list  does  not  include  any  paths  internal  to  func¬ 
tional  units,  of  which  there  are  many. 

This  list  suggests  that  a  number  of  paths  are  limited  by  the  logic  depth  of  the  score- 
board  and  reorder  buffer.  Several  designs  were  evaluated  for  each  of  these  constructs  and 
the  versions  that  were  chosen  are  close  to  optimal  for  the  constraints  of  GaAs  DCIT-.  For 
instance,  a  read  port  on  the  scoreboard  which  produces  a  result  in  7  levels  of  logic  is  imple¬ 
mented  using  a  stage  of  multiplexors  followed  by  a  tristate  gate.  Leakage  currents  in  GaAs 
constrain  the  amount  of  fanin  that  can  be  tolerated  by  a  tristate  gate,  whereas  in  another 
technology,  such  as  CMOS,  several  levels  could  be  removed  by  constructing  the  read  port 
completely  from  tristates.  In  addition,  the  gate-depth  for  control  logic  tends  to  be  greater 
for  GaAs  since  the  DCFL  family  efficiently  supports  only  NOR-NOR  logic.  For  a  complex 
design  which  supports  most  features  of  the  MIPS  ISA,  the  goal  of  20  gates  per  clock  phase 
is  fairly  aggressive  and  even  the  application  of  significant  additional  human  resources 
could  only  reasonably  reduce  this  by  approximately  2  gate  levels. 

So  in  what  ways  can  the  design  be  made  simpler  while  still  attaining  the  same  level 
of  performance?  An  in-order  issue  and  completion  policy  would  remove  the  need  for  a  re¬ 
order  buffer.  As  mentioned  earlier,  the  reorder  buffer  provides  several  benefits;  those  that 
are  relevant  to  this  discussion  are:  1)  retaining  the  in-order  sequence  of  instruction  for  an 
out-of-order  completion  policy,  and  2)  providing  support  for  precise  memory  exceptions. 
The  former  is  not  necessary  for  in-order  completion  and  the  latter  can  be  handled  by  trans- 
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Table  3.38  Percentage  of  Memory  References  that  Miss  in  the  IPU  Mini-TLB 


Benchmark 

Tr  Memor> 
Reterences 

aJvmn 

0.191 

doduc 

0.403 

ear 

0.195 

fpppp 

0.081 

hydro2d 

4.813 

mdljdp2 

0.664 

indljsp2 

0.546 

nasa7 

1,034 

ora 

0.002 

spice2g6 

8.425 

su2cor 

5.569 

swm256 

3.175 

tomcatv 

6.827 

waves 

0.271 

Avg  Arithmetic 

2.3 

ferring  floating-point  instructions  which  follow  a  memory  reference  only  when  it  is  deter¬ 
mined  that  a  page  fault  cannot  occur.  Most  memory  references  hit  in  the  small  first  level 
TLB  that  resides  in  the  IPU  (see  Table  3.38),  and  will  not  stall  the  transfer  of  floating-point 
instructions  more  than  a  single  cycle.  As  a  result,  this  requirement  should  have  only  a  mod¬ 
est  effect  on  performance,  especially  in  light  of  the  high  integer  content  of  many  floating¬ 
point  programs.  For  a  technology  with  higher  integration  levels,  MMU  functionality  would 
be  contained  on  the  same  chip  as  the  IPU  and  all  memory  access  exceptions  could  be  re¬ 
solved  in  a  single  cycle.  It  is  also  interesting  to  note  that  removing  the  current  reorder  buffer 
would  reduce  chip  area  by  approximately  20%.  The  reorder  buffer  is  not  as  area-efficient 
as  the  register  file,  which  is  based  on  a  6-transistor  memory  cell  and  sense-amplifier  read 
ports;  instead,  multiplexors  and  tristates  are  used  in  the  reorder  buffer  for  write  and  read 
access.  The  reorder  buffer  contains  8  entries,  each  with  89  bits  for  a  total  of  7 12  bits  of  state. 
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versus  2048  bits  for  the  register  file;  still  the  reorder  buffer  is  859?  larger  than  the  register 
file.  A  future  version  of  the  FPU  should  base  most  or  all  of  the  reorder  buffer  design  on  the 
denser  register  file  style. 

A  scoreboard  may  still  be  the  most  efficient  way  to  resolve  data  dependencies,  but 
issue  logic  would  be  simpler  due  in  part  to  the  fact  that  dual  issue  would  not  be  possible;  a 
simple  in-order  completion  policy  could  not  simultaneously  issue  two  instructions  to  func¬ 
tional  units  that  have  different  latencies.  The  omission  of  a  reorder  buffer  would  also  sim¬ 
plify  the  issue  logic  needed  to  detect  data  dependencies,  since  operands  could  only 
originate  from  the  register  file.  In  fact,  there  are  very  few  constraints  that  could  prevent  is¬ 
sue,  and  all  are  fairly  easy  to  resolve: 

1.  an  instruction  is  being  executed  in  a  different  functional  unit  than  that  needed  by  the 
instruction  to  be  issued, 

2.  the  current  instruction  depends  on  the  result  of  an  outstanding  instruction, 

3.  the  current  instruction  uses  a  non-pipelined  functional  unit  which  is  busy  executing 
a  preceding  instruction, 

4.  for  a  load  instruction,  the  head  entry  of  the  load  queue  does  not  contain  valid  data, 

5.  for  a  store  instruction,  there  are  no  free  entries  in  the  store  queue. 

Together,  these  changes  would  allow  a  new  logic  depth  target  of  15  gates  per  clock 
phase,  which  corresponds  to  an  overall  reduction  in  path  length  of  25%.  The  register  file 
access  time  would  also  need  to  improve,  from  the  current  1.8ns  to  something  closer  to 
1.4ns.  This  may  well  be  attainable  since  all  read/write  decoding  is  currently  performed  in 
the  same  clock  phase  as  the  access  and  it  could  be  moved  to  the  phase  that  precedes  the 
access,  removing  400ps  to  500ps  from  these  critical  paths.  This  would  need  to  be  explored 
further. 

Critical  paths  within  functional  units  constitute  another  issue.  It  is  often  difficult  to 
arbitrarily  add  pipeline  latches  to  a  design  to  increase  the  clock  frequency.  So  even  if  path 
depth  for  other  parts  of  the  design  is  reduced  by  a  quarter,  it  may  be  difficult  to  do  so  within 
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the  functiona]  units.  For  example,  consider  the  53-bit  mantissa  adder  which  contributes  to 
at  least  one  critical  phase  for  each  of  the  functional  units.  This  adder  is  already  pipelined  so 
that  the  first  3  to  4  gates  of  the  carry  and  sum  logic  are  generated  and  then  latched.  Adding 
latches  elsewhere  in  the  adder  can  significantly  increase  area  because  the  logic  fans  out  be¬ 
fore  converging  back  to  produce  a  single  53-bit  result.  Other  constructs,  such  as  leading  one 
prediction  and  sticky-bit  logic,  can  grow  substantially  in  size  due  to  deeper  pipelining  and 
more  complicated  routing.  As  the  area  for  these  functional  units  increases,  all  paths  are  im¬ 
pacted  since  global  routing  capacitance  increase.  As  discussed  earlier,  not  pipelining  the 
functional  units  has  only  a  modest  effect  on  overall  performance.  Doing  this  and  adding  an 
extra  cycle  of  latency  to  each  unit  would  allow  a  higher  clock  frequency.  The  analysis 
which  follows  will  examine  these  approaches,  assuming  that  deeper  pipelining  will  degrade 
cycle  time  by  10%. 

The  last  2  colunms  of  Table  3.39  compare  the  two  simpler  designs  which  achieve  a 
higher  clock  frequency  to  the  baseline  FPU  implementation  (column  1).  The  design  which 
supports  pipelined  functional  units  has  19.8%  lower  performance  and  the  non-pipelined  de¬ 
sign  has  12.2%  lower  performance.  However,  the  simpler  designs  would  have  been  easier 
to  implement,  resulting  in  a  shorter  design  cycle  for  the  FPU.  The  simpler  designs  could 
benefit  more  quickly  from  process  technology  improvements  since  there  are  fewer  critical 
paths  to  be  reevaluated. 

3.11.2  SPECfp92  Comparisons 

This  section  summarizes  the  current  state  of  microprocessor  performance,  via 
SPEC  ratings  (refer  to  Table  3.39,  Table  3.40,  and  Table  3.41).  In  addition,  5  versions  of 
the  Aurora  III  FPU  are  listed,  including  the  2  discussed  in  Section  3.1 1.1.  In  addition  to  the 
250MHz  baseline  design,  the  other  2  versions  are  projections  of  the  baseline  in  light  of  rea¬ 
sonable  technology  improvements.  The  process  technology  that  supports  the  FPU  design 
has  not  changed  in  over  2  years  but  a  newer  version  should  be  available  soon  and  it  is  in¬ 
teresting  to  speculate  on  the  impact  that  this  will  have  on  the  floating-point  performance  of 
the  Aurora  DI  design.  These  improvements  include  both  finer  interconnect  pitches  and  fast- 
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Table  3.39  Aurora  III  SPECfp92  Comparison 


Benchmark 
(50M  Instructions) 

Aurora  III 
lOOO 

base 

250MHz 

Aurora  III 
lOOO 

base 

300.MHZ 

Aurora  III 
lOOO 
base 

350MHz 

Aurora  III 
1010 
pipelined 
280MHz 

Aurora  III 
lOIO 

non-pipehned 

310MHz 

aJvinn 

H 

308.8 

209.1 

231.5 

doduc 

199.6 

239.6 

279.5 

167.6 

184.0 

ear 

371.7 

446.0 

520.3 

279.6 

296.9 

fpppp 

294.0 

352.8 

411.6 

191.1 

210.6 

hydro2d 

253.1 

303.8 

354.4 

232.9 

257.4 

mdljdp2 

288.4 

346.0 

403.7 

266.1 

289,0 

mdljsp2 

142.6 

171.1 

199.6 

130.9 

141.6 

nasa? 

453.6 

544.4 

635.1 

270.2 

299.3 

ora 

191.6 

229.9 

268.3 

157,4 

172.0 

spice2g6 

138.6 

166.3 

194.0 

141.2 

156.4 

su2cor 

426.5 

511.8 

597.1 

336.0 

366.8 

swin256 

272.4 

326.9 

381.3 

162.5 

176.3 

tomcatv 

369.3 

443.2 

517.1 

264.2 

290.7 

wave5 

171.8 

206.1 

240.5 

153.8 

169.2 

SPECfp92 

253.2 

303.8 

354.4 

203.1 

222.3 

SPECint92 

na 

na 

na 

na 

na 

%  Change  from 
250MHz  baseline 

20.0 

40.0 

-19.8 

-12.2 

er  intrinsic  gate  switching  speeds,  which  together  should  decrease  the  average  loaded  gate 
delay  by  10%  to  40%.  It  is  assumed  that  register  file  access  time  will  also  decrease,  al¬ 
though  the  amount  may  not  directly  track  gate  switching  speeds. 

Further,  there  are  a  number  of  features  of  the  FPU  that  are  not  accounted  for  in  these 
simulation-based  results,  some  of  which  should  enhance  overall  performance: 

1)  Better  code  reordering  that  reflects  the  unique  characteristics  of  the  Aurora  DI 
design;  the  compiler  used  for  these  experiments  was  tailored  to  the  MIPS  R2000/ 
3000  scalar  architecture.  This  point  has  been  discussed  briefly  in  the  context  of  float- 


96 


Table  3.40  SPEC  Ratings  for  Current  Microprocessors 


Benchmark 

(SOM 

Instructions) 

SGI 

RSOOO 

(TFP) 

75MHz 

SGI 

RIOOOO 

200MH2 

* 

DEC 

21064 

200MHz 

DEC 

21164 

300MHz 

» 

IBM  RS^ 

6000 

580 

62.5MH2 

IB.M 

Power  2 

71.5MH2 

Sun 

SuperSp2 

90MHz 

Sun 

L’liraSp 

167 

• 

aJvinn 

793.6 

na 

436.9 

na 

206.2 

na 

na 

na 

doduc 

157.2 

na 

131.0 

na 

88.6 

na 

na 

na 

ear 

596.6 

na 

587.6 

na 

174.2 

na 

na 

na 

fpppp 

279.8 

na 

193.3 

na 

172.6 

na 

na 

na 

hydrold 

484.6 

na 

216.8 

na 

126.7 

na 

na 

na 

mdljdp2 

290.5 

na 

153.1 

na 

124.2 

na 

na 

na 

indljsp2 

116.1 

na 

75.1 

na 

57.3 

na 

na 

na 

nasa7 

608.3 

na 

280.5 

na 

203.9 

na 

na 

na 

ora 

236.2 

na 

156.2 

na 

103.1 

na 

na 

na 

spice2g6 

85.2 

na 

100.2 

na 

73.7 

na 

na 

na 

su2cor 

515.4 

na 

291.9 

na 

208.1 

na 

na 

na 

swm256 

361.8 

na 

226.8 

na 

95.8 

na 

na 

na 

tomcatv 

672.6 

na 

304.6 

na 

210.3 

na 

na 

na 

waves 

180.6 

na 

115.3 

na 

69.2 

na 

na 

na 

SPECfp92 

310.6 

600* 

200.1 

500 

124.8 

274 

147 

305* 

SPECint92 

108.7 

300* 

132.7 

330 

59.2 

134 

135 

275* 

Designs  marked  with  a  have  been  announced  but  are  not  yet  commercially  available. 


ing-point  compare  sequences;  efforts  elsewhere  have  shown  performance  gains  of 
20%  to  30%  [Johnson91]  for  code  that  has  been  compiled  for  a  specific  processor 
implementation. 

2)  The  benefit  of  having  twice  as  many  floating-point  registers  as  the  R3000/R4000 
microprocessors.  This  should  somewhat  alleviate  the  long  3-cycle  primary  cache 
latency  by  providing  more  local  storage  for  intermediate  results. 


3)  Branch  prediction  is  not  modeled,  but  as  discussed  in  Section  3.10.1,  it  should  have 
only  a  small  negative  impact  on  floating-point  apphcations,  since  basic  block  size  is 
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Table  3.41  SPEC  Ratings  for  Current  Microprocessors,  continued 


Benchmark 

Intel 

Pentium 

815 

lOOMHz 

Intel  P6 

133MHz 

» 

IB.M 
Power  2 
71.5MHz 

PowerPC 

620 

130.MHZ 

PA-RISC 

HP755 

99MHz 

PA-RISC 

8000 

200.MHZ 

« 

AMD  K5 

lOO.MHz 

alvinn 

170.5 

na 

na 

na 

176.8 

na 

na 

doduc 

79.1 

na 

na 

142.0 

na 

na 

ear 

210.7 

na 

na 

na 

258.4 

na 

na 

fpppp 

na 

na 

na 

237.1 

na 

na 

hydro2d 

83.0 

na 

na 

na 

166.1 

na 

na 

mdljdp2 

95.2 

na 

na 

na 

192.1 

na 

na 

mdljsp2 

44.9 

na 

na 

na 

92.3 

na 

na 

nasa7 

60.7 

na 

na 

na 

123.3 

na 

na 

ora 

93.7 

na 

na 

na 

276.9 

na 

na 

spice2g6 

64.9 

na 

na 

na 

91.9 

na 

na 

su2cor 

56.9 

na 

na 

na 

177.2 

na 

na 

swm256 

45.1 

na 

na 

na 

79.3 

na 

na 

tomcatv 

77.7 

na 

na 

na 

138.0 

na 

na 

wave5 

55.8 

na 

na 

na 

112.1 

na 

na 

SPEC^92 

80.6 

>200* 

274 

300 

150.6 

>550* 

105* 

SPECint92 

100.0 

>200* 

134 

225 

80.0 

>360* 

130* 

Designs  marked  with  a  have  been  announced  but  are  not  yet  commercially  available. 


quite  large. 

4)  The  benefit  of  supporting  double-word  loads  and  stores  was  estimated  in 
Section  3.7.1;  it  remains  to  be  verified. 

5)  Operating  system  calls  are  not  modeled  by  a  PDQE-based  approach  to  architectural 
simulation.  For  the  SPEC  benchmarks,  the  OS  is  entered  on  average  only  1.5%  to 
2%  of  the  time;  this  is  less  than  what  is  realistic  for  current  multi-media  applications. 

6)  Hardware  support  for  square  root  can  result  in  a  7%  to  9%  improvement  in  perfor¬ 
mance,  but  has  not  been  implemented  in  the  current  design. 
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3.11.3  Final  design 

Appendix  A  contains  a  layout  plot  of  final  FPU  design,  which  can  be  summarized 
as  follows; 

-  250  MHz,  300  SPECfp92  rating 

-  500K  transistors 

-  16x16  mni2 

-  40  instructions  (MIPS  R4000),  including  double-word  loads/stores  and  integer 

multiply 

-  Iterative  5  cycle  Wallace  tree  multiplier  (4-2  adders) 

-  Pipelined  3  cycle  add  unit 

-  Pipelined  2  cycle  conversion  unit 

-  Iterative  19  cycle  SRT-8  divide  unit  (not  included  with  this  version) 

-  IEEE-754  compliant  (4  rounding  modes  and  exceptions) 

-  Precise  and  higher  performance  real-time  exception  modes 

-  Issue  policy:  in-order  issue,  out-of-order  completion,  2  instructions  per  cycle 

-  Data  prefetching  with  unity  stride 

-  Instruction  queue:  6  entries,  predecoded 

-  Load  queue:  2  entries 

-  Reorder  buffer:  8  entries 

-  Store  queue:  2  entries 

-  Result  busses:  2 

-  5  students  -  2  years 
Verification  via  random  testing 
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-  FunctionaJ  units  verified  against  actual  workstation  (>5M  operations  per  unit) 

-  FPU  verified  using  self-checking  random  instruction  sequences  (>2(X)M 
instructions) 

Design-for-test  includes: 

-  5  scan  chains  (4  data,  1  control) 

-  Full  read/write  access  to  register  file 

-  Individual  instructions  are  testable  via  scan  paths 

-  Speed  testing  requires  only  the  2  clock  signals  to  operate  at  high  frequency 


CHAPTER  4 


Implementation  of  a  High 
Performance  Floating¬ 
point  Unit 

The  culmination  of  this  work  has  been  the  design  of  an  IEEE-754  compliant  double 
precision  floating-point  unit  as  part  of  the  Aurora  HI  processor.  The  chip  was  designed  in 
a  1.0|im  (0.6|im  effective  gate  length)  GaAs  direct  coupled  FET  logic  process.  The  discus¬ 
sion  will  focus  first  on  trade-offs  for  the  various  functional  units  that  are  appropriate  for  the 
circuit  and  integration  constraints  of  GaAs.  These  constraints  include  low  fanin,  greater 
logic  depth  due  to  the  use  of  only  NOR-NOR  logic,  the  absence  of  dynamic  logic  struc¬ 
tures,  a  limited  use  of  pass-gate  logic,  and  lower  layout  density  that  results  from  a  ratioed 
logic  family.  Much  of  the  motivation  for  the  functional  unit  designs  originated  with  work 
done  elsewhere,  but  has  been  extended  in  order  to  accommodate  differences  that  result 
from  using  a  high-performance  GaAs  technology.  Also,  a  number  of  corrections  to  the  orig¬ 
inal  references  were  discovered  during  the  verification  of  these  units.  Further,  the  conver¬ 
sion  unit  that  is  described  is  an  original  design  that  can  execute  any  of  the  6  conversion 
operations  called  for  in  the  IEEE-754  specification  with  a  latency  of  2  cycles.  The  second 
part  of  this  chapter  will  focus  on  a  set  of  general  issues  that  arise  while  designing  a  floating¬ 
point  unit,  including  floating-point  loads  and  stores,  the  use  of  predecoding  to  reduce  crit¬ 
ical  path  depth,  and  reasonable  design-for-test  features. 

4.1  Add  Unit 

The  IEEE-754  compliant  double  precision  floating-point  addition  unit  supports  the 
four  rounding  modes  specified  by  the  IEEE  standard:  round  to  nearest,  round  to  <»,  round 
to  -«>,  and  round  to  zero.  The  add  unit  design  is  fully  pipelined,  with  a  latency  of  three  4ns 
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clcx:k  cycles.  £ind  consists  of  50,000  transistors, 

4.1.1  Adder  Implementation 

As  discussed  in  the  section  on  circuit  issues,  GaAs  DCFL  presents  several  challeng¬ 
es  compared  to  CMOS  design,  including  low  noise  margins,  the  use  of  NOR-only  logic, 
and  low  fanin.  In  light  of  these  technology  issues,  a  number  of  different  adder  implemen¬ 
tations  were  evaluated  for  the  mantissa  addition,  including  carry  skip,  4-bit  carry  look¬ 
ahead,  and  Ling-modified  carry-select  designs. 

The  carry-skip  adder  was  not  to  be  appropriate  for  several  reasons  [Turrini89],  [Ma- 
jerski67],  [Lehmanbl].  The  small  fanin  of  GaAs  increases  the  number  of  gate  levels  needed 
for  the  skip  logic.  This  is  especially  true  for  implementing  multiple  levels  of  skip  logic, 
which  is  necessary  to  achieve  the  lowest  critical  path  lengths.  Intrinsic  gate  speeds  in  GaAs 
are  fast  (70  ps  for  an  unloaded  inverter),  but  interconnect  and  fanout  increase  the  average 
delay  to  about  100  ps  per  level;  an  adder  such  as  the  carry-skip  design  with  greater  than  15 
levels  of  logic  is  unacceptable  on  a  chip  with  a  target  of  20  gates  per  clock  phase.  The  NOR- 
only  limitation  is  also  problematic,  since  there  are  instances  where  the  lack  of  a  single-level 
AND  gate  leads  to  additional  levels  of  logic.  The  cany-skip  adder  requires  that  nodes  along 
the  carry  chain  be  reset  to  a  logic  low  value  prior  to  performing  an  addition.  In  other  tech¬ 
nologies,  this  can  be  handled  efficiently  by  pre-discharging  these  nodes.  In  GaAs  this  is  not 
feasible  because  of  the  high  leakage  current  and  the  diode  gate  of  the  MESFET  transistor. 
Consequently,  a  mux-based  approach  is  necessary,  leading  to  an  increase  in  the  levels  of 
logic.  Finally,  delay  through  a  carry-skip  adder  is  proportional  to  the  square  root  of  the 
number  of  bits,  versus  a  log  relationship  for  a  look-ahead  adder.  As  the  number  of  bits  in¬ 
creases,  there  is  a  cross-over  point,  beyond  which  a  look-ahead  approach  is  faster;  the  53- 
bit  mantissa  adder  used  throughout  the  FPU  is  more  efficiently  implemented  using  a  look¬ 
ahead  approach.  The  FPU  needs  several  sizes  of  adders  and  it  was  desirable  to  choose  a  sin¬ 
gle  adder  design  that  could  be  extended  to  support  a  variety  of  bit  widths. 

The  Ling-modified  adder  is  shown  in  Figure  4.1.  The  53  bit  wide  version  in  this  fig¬ 
ure  is  divided  into  six  blocks,  each  with  a  width  of  9  bits.  Each  block  is  further  divided  into 
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Figure  4.1 53>bit  Ling  Adder 

3-bit  groups,  with  each  block  having  3  groups.  The  adder  uses  a  carry-select  algorithm 
within  each  group  and  each  block.  A  ripple  sum  is  created  for  3-bit  groups  for  both  a  carry- 
in  of  1  and  a  carry-in  of  0.  These  pairs  of  three  bit  sums  are  then  fed  into  a  multiplexor  to 
form  the  9-bit  block  sums.  The  block  cany  generate  and  propagate  signals  are  used  to  form 
the  9-bit  block  sums.  The  six  block  sums  are  then  fed  into  a  second  level  of  multiplexors  to 
create  the  final  sum.  The  global  generate  and  propagate  signals  are  then  used  to  select  the 
appropriate  9-bit  block  for  the  final  sum.  The  group-generate  equation  is: 

^5  ~  *17  ■'■^16^17  ■'■*15^16^17 

The  Ling  technique  [LingSl]  uses  g,  =  to  simplify  the  generate  equation: 

^5  =  Pn^Sn  +  Sie  +  SisPii) 


^5  ~  Pn^s 
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Expanding  the  Cj*  term  gives: 

Cj  =  ai7^’i7  +  a|6fci6  +  o,5fc|jC|(,  +  a,jfc|,t,jj 

The  primary  advantage  of  this  adder  comes  from  expressing  the  local  group  carry' 
generation  in  terms  of  c*  ,  a  signal  generated  from  4  terms  of  10  inputs.  Traditional  carry 
look-ahead  adders  produce  group  generate  signals  in  7  terms  of  24  inputs.  Fanin  limitations 
dictate  an  optimum  group  size  of  3  bits. 

The  analysis  of  critical  paths  for  the  Ling-modified  adder  and  a  group-4  carry  look¬ 
ahead  adder  were  based  on  [Stritter90].  A  “direct  search”  algorithm  was  used  to  minimize 
functions  of  numerous  variables.  In  the  case  of  the  CLA-4  adder,  there  are  14  independent 
variables,  one  for  each  of  the  enhancement  transistor  widths  that  lie  along  the  top  critical 
path.  The  algorithm  focuses  on  one  variable  at  a  time  and  evaluates  f(x,xi),  f(x,xi  -  delta  * 
ei),  and  f(x,  xi  -i-  delta  *  ei),  where  x  represents  all  variables  not  currently  being  evaluated, 
xi  is  the  current  variable  being  tested,  delta  is  the  step  size,  and  ei  is  a  unit-vector  for  the 
variable  xi.  One  of  these  three  expressions  will  result  in  the  smallest  value  for  the  function 
and  this  is  an  indication  of  a  direction  that  warrants  further  exploration.  Constraints  are  en¬ 
forced  by  adding  penalty  values  to  the  primary  variable  when  constraints  are  violated.  In 
the  CLA-4  optimization,  we  for  example,  calculated  the  enhancement  width  for  each  level 
to  derive  the  smallest  delay  for  a  given  power  dissipation.  Thus,  power  was  a  constraint; 
whenever  the  calculated  widths  obtained  generated  too  much  power  for  the  path,  the  overall 
delay  was  penalized.  This  behavior  forces  the  algorithm  to  move  back  to  transistor  widths 
which  do  not  violate  the  power  budget. 

Delay  calculation  is  performed  using  two-dimensional  macro-models  [Kayssi93a]. 
The  macro-model  derivation  results  in  expressions  for  delay  and  output  slew  rate  for  each 
type  of  gate  as  a  function  of  fanin,  fanout,  interconnect  capacitance  and  input  slew  rate.  The 
form  of  the  expressions  are: 
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Rising  output: 

delta  =  7“^^  X  (Cq  Cj  X  y) 

Tou,  = 

Falling  output: 

delta  =  r,„  X  (Cq  +  c,  X  V  +  C2  X  /y) 

Lu,  =  Tin  X  (C3  +  C4  X  V  +  Cj  X  fy) 

Where: 

>  =  C,„.a/('^ExTJ 

C,o.ai  =  Ci^.  +  k.xLW+k^xWE 

WE  =  enhancement  width  of  driving  gate 
WL  =  sum  of  enhancement  width  loads 
c,„,  =  interconnect  capacitance 
cO  through  c5  are  fitting  coefficients 

There  are  different  coefficients  for  each  type  of  gate  being  modeled  (i.e.,  fanin).  In¬ 
terconnect  is  represented  as  a  lumped  capacitance;  this  approach  is  quite  accurate  for  the 
shorter  lengths  of  interconnect  seen  within  a  combinational  block.  The  coefficients  are  ob¬ 
tained  by  running  a  fitting  algorithm  on  the  results  of  numerous  HSPICE  simulations.  An 
empirical  relationship  for  power  as  a  function  of  enhancement  gate  width  was  also  obtained 
using  HSPICE. 

The  first  experiments  with  both  the  CLA-4  and  Ling  adders  raised  several  issues. 
First,  the  algorithm  has  a  tendency  to  choose  widths  that  result  in  large  fanouts.  Conse¬ 
quently,  it  was  necessary  to  include  a  fanout  constraint.  Whenever  the  fanout  for  a  certain 
set  of  widths  exceeds  a  chosen  threshold,  the  overall  delay  is  penalized.  The  program  then 
begins  decreasing  the  width  of  the  transistor  in  question.  Second,  there  was  a  question  about 
the  repeatability  of  results  when  the  initial  starting  point  is  changed.  By  adding  an  option 
to  randomly  generate  the  initial  widths  from  a  reasonable  range,  the  overall  optimized  delay 
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Table  4.1  Comparison  of  Path-Delay  for  Optimization  Program  and  HSPICE 


Adder  T>pe 

Opiimizaiion 

Program 

(ns) 

HSPICE 

(ns) 

Error 

CLA-4 

1.515 

1.659 

9.59c 

Ling 

1.237 

1.341 

00 

was  found  to  be  consistent  within  3%  between  runs  even  though  the  exact  partitioning  of 
delay  varied  somewhat.  Finally,  there  was  a  question  about  the  accuracy  of  delays  calculat¬ 
ed  by  this  approach.  A  comparison  was  made  between  the  prediction  of  the  optimization 
program  and  that  obtained  using  HSPICE.  For  the  worst  case  path  through  each  adder,  the 
results  are  summarized  in  Table  4.1.  In  each  case  the  sp)ecified  maximum  power  was  400 
mW.  The  error  of  less  than  10%  is  quite  good,  but  might  be  improved.  In  particular,  the 
coefficients  were  generated  only  up  to  a  fanout  of  four.  For  larger  fanouts,  extrapolation 
was  utilized  and  the  non-linear  nature  of  the  fitting  made  the  error  more  significant  for  these 
cases. 

To  meet  a  4ns  clock  cycle,  it  is  necessary  to  pipeline  the  adder.  This  is  accomplished 
by  generating  and  latching  the  propagate,  generate,  and  local  group  carry  signals  during  the 
first  clock  phase;  other  work  external  to  the  adder  would  also  be  performed  during  this 
phase.  To  reduce  levels  of  logic,  part  of  the  function  that  precedes  the  latch  was  merged 
into  the  first  stage  of  the  latch.  Figure  4.2  shows  the  merged  NOR-latch  which  generates 
the  first  stage  of  generate  logic  for  the  adder.  When  the  clock  changes  state,  a  transition  can 
occur  for  only  one  of  the  two  nodes  A  and  B,  which  feed  the  cross-coupled  pair.  Other  ap¬ 
proaches  for  merging  logic  with  a  latch  are  possible,  but  care  must  be  taken  to  avoid  those 


Figure  4.2  Merged  Nor-Latch-Buffer  Cell 
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that  allow  simultaneous  transitions  on  both  inputs  of  the  latch  pair  when  the  clock  changes. 
This  sort  of  clock  hazard  is  dependent  on  the  layout  parasitics  of  the  cell,  which,  if  not  care¬ 
fully  controlled,  can  reduce  the  yield  of  this  frequently  used  circuit. 

Since  rounding  is  merged  into  the  mantissa  addition  stage,  it  is  necessaiy-  to  gener¬ 
ate  both  A+B  and  A+B-t-1,  requiring  two  copies  of  the  latter  part  of  the  cany  generation 
tree.  The  sum  logic  and  much  of  the  initial  carry  generation  logic  is  only  implemented  once. 
A  second  cany  chain  adds  approximately  30%  to  the  overall  transistor  count  of  an  adder. 
Round  to  ±oo  requires  either  (A+B  and  A+B+1)  or  (A+B+1  and  A+B+2,);  the  latter  can  be 
realized  using  an  additional  row  of  half-adders  prior  to  the  mantissa  (discussed  below).  The 
various  functional  units  that  in  the  FPU  require  5  different  versions  of  the  Ling-modified 
adder,  as  summarized  in  Table  4,2. 

A  method  that  detects  the  potential  for  a  long  carry  propagation  time  and  that  will 
cause  a  single  cycle  stall  in  order  to  complete  the  addition  was  investigated  [Wolrich84]. 
An  upper  bound  on  the  probability,  P,  of  generating  a  stall  has  been  derived  as: 


P  = 


-m  +  W  ^  ^(-n  +  w) 
n 


where; 


w  =  width  of  addition 


Table  4.2  Ling  Adder  Implementations  Used  in  FPU 


Type  of 
Adder 

Delay  (ns) 

Area 

(urn  X  urn) 

Transistor 

Count 

Approximate 

Power 

(W) 

11  bit 

1.01 

666x909 

2,264 

0.328 

11  bit,  with  2 
carry  chains 

1.02 

914x922 

2,998 

0.418 

32  bit 

1.3 

673x2133 

5,913 

0.856 

53  bit,  2  carry 
chains,  lead- 
ing  one  logic 

1.6 

1445x5210 

16,600 

2.750 

53  bit,  2  carry 
chains 

1.5 

1035x3793 

13,627 

1.998 
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m  =  total  bits  not  included  in  any  detection  gate  (lou  order  bits) 
n  =  width  of  detection  gates 

This  idea  was  not  pursued  further  for  several  reasons.  Long  mantissa  adders  are 
needed  in  the  add,  multiply,  and  divide  units  and  adding  the  ability  to  stall  would  compli¬ 
cate  the  design  for  each  of  these  units.  In  addition,  a  more  complicated  and  potentially 
slower  acquisition  approach  for  the  result  busses  would  be  needed,  versus  the  simple  re- 
serve-at-issue  approach  that  was  implemented.  Finally,  pipelining  the  53-bit  mantissa 
adder  removed  this  component  from  being  a  bottleneck  along  critical  paths. 

4.1.2  Add  Unit  Implementation 

The  algorithm  implemented  for  the  add  unit  takes  advantage  of  several  well-known 
characteristics  of  floating-point  addition  (refer  to  Table  4.3  and  Figure  4.3)  [Quach90], 
[Quach91a],  [Quach91b],  [Quach91c].  For  example,  the  alignment  and  normalization  steps 
needed  for  addition  are  mutually  exclusive.  If  the  two  operands  are  both  positive,  or  their 
signs  differ  and  an  alignment  shift  of  more  than  one  is  needed,  the  resulting  normalization 
shift  will  be  less  than  or  equal  to  one.  Conversely,  if  the  signs  are  different  and  an  alignment 
shift  of  less  than  or  equal  to  one  is  needed,  the  normalization  shift  may  be  greater  than  one. 
An  alignment  shifter  is  needed  for  the  former  case,  while  a  simple  mux  can  be  used  to  han- 


Table  4.3  Floating-Point  Addition  Algorithm 


Pipeline  Stage 

Operation 

SOPl 

Exponent  Compare 

Mantissa  swap 

Exponent  swap 

S0P2 

Alignment  right  shift 

Sticky  bit  determination 

Guard  and  round  bit  generation 

SlPl 

Gen/Prop/Carry  for  Ling  adder 
Rounding  logic 

S1P2 

Mantissa  add  (Ling  adder) 
Leading  one  prediction 

S2P1 

Complement  result 
Normalization  left  shift 
Exponent  adjustment 

Generate  exception  signals 
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Figure  4.3  Add  Unit 

die  alignment  for  the  latter  case.  Normalization  also  requires  a  shifter  and  a  mux  in  order 
to  handle  both  paths.  By  implementing  two  mantissa  adders,  one  for  normalization  and  one 
for  alignment,  one  can  reduce  the  latency  of  floating-point  addition  to  only  two  4ns  clock 
cycles,  as  shown  in  Table  4.4.  However,  the  simulations  discussed  earlier  for  the  overall 
floating-point  unit  architecture  show  that  reducing  the  add  latency  by  one  cycle  has  a  min¬ 
imal  effect  on  overall  performance  (approximately  1.5%  improvement).  Integration  levels 
and  yields  in  GaAs  suggest  that  the  16,000  transistors  which  comprise  a  53-bit  adder  would 
be  better  applied  elsewhere. 

As  indicated  in  Table  4.3,  three  operations  occur  in  the  first  pipeline  stage  of  the 
add  unit.  First,  the  exponents  are  subtracted  to  determine  which  operand  is  larger  and  the 
amount  of  an  alignment  shift.  The  adder  used  here  is  1 1  bits  wide  and  supports  2  carry 
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Table  4.4  Two  Cycle  Add  Unit 


Pipeline  Stage 

Normalization  Path 

Alignment  Path 

1 

Small  alignment  shift 
Mantissa  atid 

Large  alignment  shift 

2 

Normalization 

Mantissa  add 

Small  normalization  shift 

chains  in  order  to  generate  both  A+B  and  A+B+1.  The  exact  functions  produced  by  this 
adder  are: 

ExpAligtiQ  =  Exp  Exp  g 
Exp  Align  j  =  ^ 

The  amount  of  alignment  shift  needed  can  then  be  generated  by: 

if  Exp  Align  shAmt  =  Exp  Align  ^ 

else  shAmt  =  ExpAlign^ 

The  mantissas  are  then  swapped  based  on  the  carry-out  of  the  ExpAlign  adder  in  order  to 
ensure  that  the  smaller  operand  is  aligned  to  the  larger  one. 

In  the  next  stage,  S0P2,  the  alignment  shift  occurs  via  a  55-bit  shifter  which  also 
generates  the  guard  and  round  bits.  This  shifter  is  implemented  as  a  cascade  of  two  8-input 
super-buffer  muxes,  where  the  mux  selects  are  simply  the  output  of  the  preceding  exponent 
adder.  Note  that  more  area  intensive  squeeze  muxes  were  used  in  order  to  minimize  the  gate 
depth  through  the  shifter,  since  several  of  the  bits  of  the  shifter  are  used  along  critical  paths 
for  both  this  phase  and  the  normalization  phase.  The  two  muxes  of  this  shifter  can  shift  by 
(0, 1, 2,  3, 4, 5, 6,  or  7)  and  (0, 8, 16, 24, 32, 40, 48,  or  56)  respectively;  the  shifter  is  com¬ 
prised  of 4,989  transistors  and  returns  a  result  in  500ps.  The  sticky-bit  is  also  generated  dur¬ 
ing  this  phase  through  the  use  of  a  thermometer  function.  This  logic  works  by  creating  a 
vector  in  which  the  number  of  low  order  bits  that  are  set  equal  the  alignment  shift  amount. 
This  vector  is  then  AND’ed  with  the  mantissa  being  aligned;  the  result  is  a  new  vector 
which  represents  those  bits  that  are  shifted  off  the  low  order  end  during  alignment.  An  OR 
tree  is  used  on  this  new  vector  to  create  a  single  result  which  indicates  whether  any  of  the 
bits  below  the  least-significant-bit  of  the  aligned  mantissa  are  set  (excluding  the  guard  and 
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round  bits,  which  are  dealt  with  separately).  This  approach  to  generating  the  sticky  bit  i> 
more  efficient  solution  than  the  direct  approach  of  using  a  106-bit  shifter.  Because  genera¬ 
tion  of  the  sticky-bit  is  not  on  a  critical  path,  the  logic  to  generate  it  was  synthesized  from 
a  behavioral  description  of  the  pertinent  equations.  Some  of  the  initial  logic  needed  for 
rounding  determination,  including  the  generation  of  the  guard  and  round  bits,  is  also  pro¬ 
duced  in  this  pipeline  stage.  In  verifying  this  design,  I  discovered  several  errors  with  the 
approach  described  in  [Quach91a];  the  final  equations  for  the  guard  and  round  bits  are  list¬ 
ed  in  Appendix  B. 

In  the  next  stage,  SlPl,  the  local  group-generate  signals  for  the  carry  tree  and  the 
initial  XOR  results  for  the  sum  are  produced  and  latched.  As  described  earlier,  this  acts  to 
split  the  operation  of  the  mantissa  across  two  clock  phases.  The  inputs  to  this  stage  of  the 
adder  are  the  aligned  operands  or  one  of  several  transformed  variations  of  the  operands. 
The  most  common  of  these  is  the  complement  of  one  of  the  operands  when  the  operation 
being  performed  is  subtraction.  A  somewhat  more  complex  case  involves  the  two  round- 
to-infinity  modes  (round-to-minus-infinity,  RM,  and  round-to-plus-infinity,  RP),  which  re¬ 
quire  the  mantissa  adder  to  produce  three  sum  results:  A-»-B,  A-t-B+1,  and  A-t-B-t-2.  To  un¬ 
derstand  why  this  is  necessary,  consider  adding  the  following  two  mantissas  under  the  RP 
mode  (this  example  is  abbreviated  to  5  bits  for  clarity): 

1 1000  0  operand  A,  whose  exponent  is  twice  as  big  as  that  for  operand  B 
01000  1  operand  B,  which  has  been  shifted  right  one  bit  for  alignment 

100000  1  right-mostbitsarethelsb(L)andtheguardbit(G);thestickybit(S)iszero 
The  intermediate  result  needs  to  be  shifted  right  by  one  for  normalization  and  then 
rounded  by  one  to  satisfy  the  RP  mode.  However,  rounding  has  been  merged  into  the  man¬ 
tissa  addition  phase  to  reduce  latency  and  to  allow  the  use  of  only  one  53-bit  adder.  This 
means  that  to  round  properly,  a  one  must  be  added  to  the  bit  position  just  above  the  Isb, 
instead  of  at  the  Isb  position  (anticipating  the  subsequent  normalization).  In  other  words, 
the  adder  must  be  able  to  produce  A+B-t-2.  Interestingly,  this  is  not  needed  for  the  round- 
to-nearest  (RN)  mode  since  in  this  case  the  rounding  one  is  added  to  the  Isb  only  when  L=l, 
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G=l,  S=0;  consequently,  adding  a  one  at  the  Isb  position  will  result  in  a  propagation  which 
does  not  depend  on  whether  a  right  shift  occurs.  This  is  the  tie  case  for  RN  (i.e.,  in  decimal, 
5.5  would  be  considered  a  tie,  which  for  a  similar  binary  example  would  be  rounded  up  only 
if  the  Isb  is  set);  for  the  situation  where  L=l,  G=l,  S=1  and  a  1  bit  normalization  shift  is 
required,  a  rounding  one  added  at  the  Isb  will  suffice  due  to  propagation.  The  question  aris¬ 
es  of  how  to  generate  the  three  different  sums  for  RP  without  adding  yet  another  carry  tree. 
To  generate  A-i-B+2,  a  row  of  half-adders  is  inserted  into  the  pipeline  prior  to  the  generate/ 
sum  stage  of  the  adder,  as  shown  in  Figure  4.4.  The  A+B-i-1  result  can  be  created  simply  by 
setting  the  Isb  of  the  adder  output  to  one,  since  this  case  will  only  be  needed  when  the  Isb 
is  zero. 

An  optimization  that  has  been  alluded  to,  is  to  round  during  mantissa  addition  in¬ 
stead  of  during  a  separate  step.  The  rounding  logic  determines  which  of  the  adder  outputs 
to  select  and  is  derived  during  both  this  stage  and  the  subsequent  S1P2  stage.  Much  of  this 
logic  has  been  implemented  by  hand  since  it  is  on  a  critical  paths  for  this  phase.  Several 
signals  are  derived  from  64-entry  truth  tables  which  were  optimized  using  6-input  Kar- 
nough  maps.  The  mux-reduction  technique  described  in  Section  5.9  was  used  extensively 
to  factor  out  late  arriving  signals,  such  as  the  carry-out  of  the  mantissa  adder.  A  number  of 
bugs  were  identified  during  verification  and  led  to  modifications  for  the  equations  derived 
in  [Quach91a];  the  final  equations  are  summarized  in  Appendix  B. 


Figure  4.4  Generating  A-fB+2  for  RM/RP  Rounding  Modes 


During  the  next  stage,  S1P2.  the  size  of  a  normalization  shift  is  determined  in  par¬ 
allel  with  the  mantissa  addition.  While  leading  one  prediction  (LOP)  requires  additional 
logic,  it  does  not  need  to  await  the  completion  of  the  addition.  As  mentioned,  during  the 
alignment  phase  the  two  exponents  are  compared,  and  the  mantissas  are  swapped  if  re¬ 
quired  to  ensure  that  the  result  will  be  positive.  However,  if  the  exponents  are  equal,  the 
result  may  still  be  negative  and  the  LOP  logic  must  be  able  to  detect  either  a  leading  one 
(positive)  or  a  leading  zero  (negative)  in  the  result.  An  alternative  would  be  to  perform  a 
magnitude  comparison  and  swap  the  mantissas  during  the  alignment  stage.  The  first  ap¬ 
proach  results  in  slightly  more  complicated  LOP  logic,  while  the  second  requires  an  addi¬ 
tional  alignment  shifter  since  the  magnitude  comparison  will  be  done  concurrently  with  the 
actual  alignment;  to  minimize  area  considerations,  the  former  approach  was  chosen. 

The  LOP  logic  was  initially  based  on  [Quach91b]  and  can  be  broken  down  into 
three  stages.  First  a  vector  Ubar  is  generated  from  the  two  inputs  to  the  adder.  The  first  oc¬ 
currence  of  a  zero  in  Ubar  (from  the  most  significant  bit)  indicates  the  position  of  the  lead¬ 
ing  one/zero.  The  equations  that  define  this  vector  represent  all  cases  that  have  the  ability 
to  generate  a  leading  one/zero  from  the  two  inputs.  In  this  discussion,  the  following  con¬ 
ventions  are  assumed: 

Ai  and  Bi  are  the  ith  bits  of  the  input  operands  A  and  B 

Ti  =  X0R(Ai3i) 

Zi  =  NOR(Ai,Bi) 

Gi  =  AND(Ai,Bi) 

The  expression  that  was  originally  used  is  equation  (4)  from  [Quach91b]: 

V,  =  h- 1  +  7.- ,  ( (r,.2  ©  A.. ,)  c,) 

However,  random  verification  found  a  number  of  cases  not  covered  by  this  equation  and  a 
new  relationships  for  t/,.  was  derived  as  the  OR  of  the  following  terms: 


11? 


C-TZ., 

Z^.  iG7,_ 

1 

2,-.7,Z,. 

^,*,7,7,., 

C,>,Z,7,-i 

2,.,G,7,- 

These  16  equations  contain  a  certain  degree  of  redundancy  and  can  be  reduced  to  an  OR  of 
the  following  seven: 


2,.,  7-7,-, 


1 


z.^i  (G.ec,_,) 


These  reductions  make  use  of  certain  characteristics  of  T,  Z,  and  G: 

=  r,.+z, 
z.  =  r,+G,. 
f.  =  z,+c, 

In  addition,  the  most-significant  bit  of  Ubar  is  generated  by  the  following  relationship: 

Ubar^2  -  ^52^51^50 ■'■^52^51^50 ■*■^52^51^50 ■'■^527^51^50 ■*■^527^51^50 

The  cell  used  to  generate  the  Ubar  vector  consists  of  20  gates  and  9  inputs.  This 
large  number  of  inputs  results  in  the  cell  being  wire  limited  and  the  effective  height  of  the 
cell  (including  spillover  routing  from  the  overcell  region)  exceeds  all  other  datapath  cells 
by  15  microns.  As  a  result,  this  cell  sets  the  datapath  pitch  for  one-half  of  the  FPU,  adding 
an  additional  6%  to  the  overall  area  of  the  chip.  Additional  design  effort  devoted  to  this  cell 
should  be  able  to  resolve  this  problem. 
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Ubar  is  converted,  through  the  use  of  a  parallel  CLA-like  tree,  into  a  second  vector. 
“SH”,  which  contains  a  single  one  at  the  position  of  the  leading  one/zero.  This  conversion 
is  reflected  in  the  following  equation; 

SH^  =  AND  IJbar 

The  parallel  look-ahead  reduction  is  shown  in  Figure  4.5  and  the  variables  are  defined  as: 

GlopO  =  AND(Ubar53,Ubar52,Ubar5 1 ) 


Glopl7  =  AND(Ubar2,Ubarl,UbarO) 


GBlopO  =  AND(GlopO,Glopl) 
GBlopl  =  AND(GlopO,Glopl,Glop2) 


GloplO  =  AND(Glopl5,Glopl6) 

Glopll  =  AND(Glopl5,Glopl6,Glopl7) 

GGlopO  =  AND(GBlopl,GBlop3) 

GGlopl  =  AND(GBlopl,GBlop3,GBlop5) 

GGlop2  =  AND(GBlopl,GBlop3,GBlop5,GBlop7) 

GGlop3  =  AND(GBlopl,GBlop3,GBlop5,GBlop7,GBlop9) 

This  one-of-53  SH  vector  is  reduced  to  the  6-bit  encoded  shift  amount  (Elop)  that 

controls  the  normalization  shifter.  The  defining  equations  are: 


[g|][gi][^ 

GloplO 

Glop9  *^Glopll 
GBlop6  GBlop7 

G5  G4  G3 
Glopl3 

Glopl2  Glopl4 
GBlopS  GBlop9 

Glo 

GBl 

1 

G1od16 
pl5  Glopl? 

oplOGBlopll 

Ginf 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

1 

Global  Generate  LOP  I 

Figvre  4.5  Generation  of  SH  Vector  for  LOP 
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Elop(O) = 

OR(SH{  1,3.5,7,9,11,13,15.17,19,21.24.25,27.29.31.33.35,37.39.41.43.45.47.49. 
51}) 

Elop(l) = 

OR(SH  { 2,3,6,7, 10, 11,14,15,18,1 9,22,23,26,27,30,3 1 ,34,35,38,39,42.43,46.47,5 
0,51}) 

Elop(2)  = 

OR(SH{4,5,6,7,12,13,14,15,20,21,22,23,28,29,30,31,36,37,38,39,44, 45,46,47,5 

2}) 

Elop(3) = 

OR(SH{8, 9,10, 11, 12,13,14,15,24,25,26,27,28,29,30,31,40,41,42,43,44,45,46,47 

}) 

Elop(4)  = 

OR(SH  { 1 6, 1 7, 1 8, 1 9,20,2 1 ,22,23,24,25 ,26,27,28 ,29,30,3 1 ,48,49,50,5 1 ,52 } ) 
Elop(5)  = 

OR(SH{  32,33,34,35,36,37,38,39,40,41 ,42,43,44,45,46,47,48,49,50,5 1 ,52 ) ) 


The  Elop  vector  is  used  to  predict  the  leading-one  position  for  normalization,  but  it 
will  be  accurate  only  to  within  one  bit  position.  In  other  words,  it  may  be  necessary  to  per¬ 
form  a  1-bit  fine  adjustment  shift  using  a  multiplexor. 

Several  actions  occur  during  the  last  stage,  S2P1.  First,  if  no  alignment  was  neces¬ 
sary,  the  output  of  the  adder  may  be  negative,  since  no  magnitude  comparison  was  per¬ 
formed  on  the  mantissas.  Consequently,  this  intermediate  mantissa  result  may  need  to  be 
complemented.  At  this  point,  any  large  normalization  shift  is  performed  using  the  value  of 
Elop  generated  from  the  leading-one  prediction  logic.  This  normalization  shifter  is  similar 
in  organization  to  the  aligrunent  shifter,  using  a  cascade  of  two  8-input  multiplexors.  The 
final  mantissa  selection  depends  on  several  things,  including  whether  the  alignment  or  nor¬ 
malization  path  was  selected,  the  decision  of  the  rounding  logic,  and  whether  an  additional 
1-bit  normalization  is  needed.  The  last  condition,  derived  by  looking  at  the  most-significant 
bit  of  the  normalization  shifter,  sets  the  effective  logic  depth  along  this  path  at  20  gates.  The 
term  “effective”  is  used  here  because  the  actual  gate  depth  is  less  than  20,  but  one  of  the 
gates  must  drive  a  signal  across  a  53-bit  column  of  the  fine  adjustment  multiplexor.  In  all 
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other  pans  of  the  FPU.  multiplexor  select  signals  are  derived  dunng  the  phase  previous  to 
when  they  are  actually  used.  A  buffered  gate  can  drive  the  capacitance  and  fanout  encoun¬ 
tered  for  a  53-bit  column  in  300  to  400  ps,  or  approximately  3  to  4  gate  delays.  The  critical 
path  just  described  could  be  reduced  by  adding  logic  in  parallel  to  the  mantissa  adder  in  or¬ 
der  to  determine  whether  a  fine  adjustment  shift  will  be  needed. 

Also  during  this  phase,  the  exponent  (Ef)  is  adjusted  to  reflect  the  outcome  of  any 
normalization.  The  various  possibilities  are  summarized  in  Table  4.5.  The  actions  indicated 
in  this  table  are  implemented  using  an  1 1-bit  exponent  adder  which  has  two  carry  chains 
and  can  produce  A-i-B  and  A+B+1.  A  number  of  the  signals  which  generate  the  offset  for 
this  adder  and  which  determine  the  result  that  is  selected  arrive  late  in  the  phase.  To  miti¬ 
gate  the  effect  of  late-arriving  signals,  this  logic  was  hand-generated  like  the  rounding  log¬ 
ic. 

Finally,  information  about  exceptions  and  the  form  of  the  result  is  generated  during 
this  phase.  Several  bits  encode  whether  the  result  is  infinite,  not-a-number  (NaN),  or  zero. 
These  bits,  used  during  the  reorder  buffer  write  phase,  select  either  the  computed  result 
from  the  add  unit  or  one  of  the  three  constants.  The  same  circuitry  is  used  for  the  other  func¬ 
tional  units.  Generating  these  bits  in  the  reorder  buffer  write  phase  works  well  because  cir¬ 
cuitry  used  in  the  final  phase  of  most  functional  units  has  the  critical  logic  depth,  and  it 
cannot  accommodate  the  additional  selection  logic. 


Table  4^  Generation  of  Final  Exponent 


Normalization  Class 

Ef 

Many  Left  Shift  (MLS) 

Efi  -  Elop 

Many  Left  Shift  and  Fine  Adjust 

Efi  -  Elop  -  1 

One  Left  Shift  (OLS) 

Efl- 1 

No  Shift  (NXS) 

Efi 

One  Right  Shift  (ORS) 

Efl-b  1 
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4.1.3  Comparison  Instructions 

The  MIPS  ISA  specifies  a  broad  range  of  comparisons  for  floating-point  data  via 
three  possible  conditions:  unordered  (either  operand  is  a  NaN),  equal  (the  result  mantissa 
is  zero),  and  less  than  (the  result  is  negative).  The  various  predicates  are  encoded  into  the 
lower  3  bits  of  the  instruction  and  are  implemented  by  subtracting  one  of  the  specified  op¬ 
erands  from  the  other.  Instead  of  producing  a  data  result,  they  generate  a  single  bit  which 
indicates  the  outcome  of  the  comparison;  this  bit  is  written  into  the  floating-point  status  reg¬ 
ister  via  the  reorder  buffer.  In  addition,  this  bit  of  the  status  register  is  exported  from  the 
FPU,  along  with  a  condition-valid  signal,  to  be  used  by  branch-on-FPU  instructions. 

4.1.4  Functional  Verification 

The  add  unit  design  was  fed  random  floating-point  numbers  in  each  of  the  three 
main  operating  regimes:  alignment  equal  to  zero,  equal  to  one,  and  greater  than  one.  The 
same  random  operand  pairs  were  also  fed  to  a  verification  program  running  on  the  host 
workstation  and  the  results  of  both  were  compared.  In  this  way  the  results  of  several  hun¬ 
dred  thousand  computations  were  verified,  with  a  performance  of  1.5  calculations  per  sec¬ 
ond.  At  this  point,  a  compiled-code  version  of  the  add  unit  was  used  to  improve  the 
simulation  time  to  30  calculations  per  second  and  more  than  10  million  new  calculations 
were  performed. 

4.2  Conversion  Unit 

The  MIPS  ISA  supports  conversions  between  any  of  the  3  number  formats  (integer/ 
word,  IEEE-754  single,  and  IEEE-754  double  precision  floating-point),  which  result  in  6 
possible  operations: 

cvt.d.s  single  to  double 

cvt.d.w  word  to  double 


cvt.s.d 


double  to  single 
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cvi.s.w 

word  to  single 

C\'t.W.S 

single  to  word 

c\'t.w.d 

double  to  word 

A  commonly  encountered  sequence  of  instructions  involves  changing  the  rounding 
mode  and  then  convening  an  operand  to  the  integer  format.  Like  the  MIPS  R4000,  the  Au¬ 
rora  in  FPU  adds  the  following  instructions,  which  encode  the  rounding  mode  directly  into 
the  instruction  to  eliminate  the  need  to  change  the  control  register  before  and  after  the  con¬ 
version: 


ceil.w.s  same  as  cvt.w  but  for  the  round-plus-infmity  (RP)  mode 
ceil.w.d 

floor.w.s  same  as  cvt.w  but  for  the  round-to-minus-infinity  (RM)  mode 
floor.w.d 

round.w.s  same  as  cvt.w  but  for  the  round-to-nearest  (RN)  mode 
round.w.d 

trunc.w.s  same  as  cvt.w  but  for  the  round-to-zero  (RZ)  mode 
trunc.w.d 

Initially,  we  intended  to  implement  conversions  in  the  add  unit  to  save  logic.  How¬ 
ever,  many  of  the  paths  in  the  optimized  add  unit  are  already  at  the  limit  of  20  gates  per 
phase,  especially  in  the  rounding  logic.  After  further  investigating  the  operations  and  logic 
needed  to  support  conversions,  it  became  clear  that  merging  the  two  sets  of  instructions  into 
the  add  unit  would  adversely  impact  the  unit’s  speed.  The  final  design  for  the  conversion 
unit  includes  a  modest  30,000  transistors. 

For  the  discussion  which  follows,  refer  to  Figure  4.6  which  shows  the  block  dia¬ 
gram  of  the  conversion  unit.  The  organization  of  the  unit  evolved  from  first  identifying  the 
operations  and  exceptions  associated  with  each  of  the  conversion  types.  The  following 
summarizes  these  actions,  describing  first  the  steps  involved  in  the  conversion  and  then  the 
possible  exceptions  that  may  result: 

cvt.d.s 


1 .  Shift  mantissa  left  from  24  bits  to  53  bits.  Since  floating-point  operands  are  normal- 
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Figure  4.6  Conversion  Unit 

ized  to  1.XX,  this  shift,  performed  using  a  multiplexor,  will  always  be  by  a  constant 
number  of  bits. 

2.  Re-bias  the  exponent:  =  e^+ 896, where  ej  and  refer  to  the  8  bit  and  1 1  bit  expo¬ 
nents  for  single  and  double  precision.  The  subscript  “10”  for  “896”  means  that  this 
number  is  of  base  10;  some  numbers  for  the  discussion  that  follows  may  be  repre¬ 
sented  in  hexadecimal,  or  base  16.  The  constant  “896”  results  from  subtracting  the 
bias  for  a  single  precision  number  (“127”)  from  the  bias  for  a  double  precision  num¬ 
ber  (“1023”). 


Invalid: 

Inexact: 

Underflow: 

Unimplemented: 


Operand  is  a  signalling  NaN  and  flag  is  enabled.  A  quiet  NaN  oper¬ 
and  will  produce  a  quiet  NaN  result  if  this  flag  is  not  enabled. 
Cannot  occur,  since  no  rounding  is  needed. 

Cannot  occur,  since  the  range  and  precision  of  the  double  precision 
format  is  larger  than  that  for  single  precision  numbers. 

1.  Operand  is  NaN  and  Invalid  flag  is  disabled. 
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2.  Source  operand  is  a  denormal, 

3.  Instruction  attempts  conversion  from  double  to  double,  which  is 
not  allow'ed. 

Overflow;  Cannot  occur,  for  same  reason  as  Underflow. 


cvt.d.w 

1.  Normalize  mantissa  by  shifting  right  until  the  msb  is  a  one.  This  will  be  done  with 
leading-one  prediction  logic  and  a  left  shifter. 

2.  If  the  two’s-complement  operand  is  negative,  complement  the  normalized  result  in 
order  to  obtain  a  sign-magnitude  representation. 

3.  Generate  A+1  (’A’  being  the  mantissa  result  from  step  2)  in  order  to  complete  the 
two’s-complement  inversion  of  step  2. 

4a.  Select  between  A  and  A+1  depending  on  the  sign  of  the  input  operand. 

b.  Generate  the  result  exponent:  32, (shAmt+ 1)  + 1023, ^  +  i054,o 


Invalid:  Cannot  occur,  since  there  is  no  representation  for  a  signalling  NaN 

in  the  word  format  (although  a  quiet  NaN  is  possible). 

Inexact:  Cannot  occur,  since  no  rounding  is  needed. 

Underflow:  Cannot  occur. 

Unimplemented:  1 .  Operand  is  NaN  and  Invalid  flag  is  disabled. 

2.  Instruction  attempts  conversion  from  word  to  word,  which  is  not 
allowed. 

Overflow:  Cannot  occur. 


cvt.s.d 

1.  Derive  the  guard,  round,  and  sticky  bits. 

2.  Add  a  rounding  bit  if  necessary. 

3a.  Adjust  exponent:  c,  =  -c^+896,o 

b.  Normalize  mantissa  by  one  and  add  one  to  exponent,  if  necessary. 


Invalid: 


Operand  is  a  signalling  NaN  and  flag  is  enabled. 
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Inexact:  1.  If  guard,  round  or  sticky  bits  are  set.  In  other  words,  the  result  is 

not  the  same  as  an  infinitely  precise  result. 

2.  If  an  overflow  has  occurred. 

Underflow:  If  enabled  and  tininess  (result  falls  into  the  denormalized  range)  or 

if  disabled,  tininess,  and  loss  of  accuracy  (guard,  round,  or  sticky 
bits  are  set). 

Unimplemented:  1.  Source  operand  is  a  signalling  NaN  and  Valid  exception  is  not 
enabled. 

2.  Source  is  denormalized. 

3.  Instruction  attempts  conversion  from  single  to  single,  which  is  not 
allowed. 

4.  Underflow  has  occurred. 

Overflow:  If  overflow  occurs.  Care  must  be  taken  to  examine  the  result  after  a 

possible  1  bit  normalization  is  performed. 

cvt.s.w 

1.  Normalize  mantissa,  using  leading-one  logic  and  left  shifter. 

2a.  If  the  operand  is  negative,  complement  the  normalized  result, 
b.  Generate  guard,  round,  and  sticky  bits. 

3.  Produce  A,  A+1  using  32  bit  mantissa  adder. 

4a.  Depending  on  sign  of  operand,  rounding  mode,  and  guard/round/sticky  bits,  select 
between  the  results  of  step  3. 

b.  Generate  the  exponent:  -  (5/L4/nf+i)  +32,0+127,0  =  -jMmf+i58,o. 

5.  Normalize  mantissa  by  one  and  adjust  exponent,  if  necessary. 


Invalid: 

Inexact: 

Underflow: 

Unimplemented: 


Overflow: 


Cannot  occur,  since  there  is  no  signalling  NaN  for  the  word  format. 
If  guard,  round,  or  sticky  bits  are  set. 

Cannot  occur,  since  the  range  and  precision  of  the  single  format  is 
large  enough  to  represent  any  32  bit  two’s  complement  integer. 

1.  If  a  source  operand  is  a  signalling  NaN  and  the  Valid  exception  is 
not  enabled. 

2.  Instruction  attempts  to  convert  from  single  to  single,  which  is  not 
allowed. 

Cannot  occur. 


cvt.w.d 


la.  Shift  mantissa  right  by  -  ( (-  io:3,Q  +  fj)  +  i)  +32,0  =  1054,^.  If  the  amount  of  the 

shift  is  less  than  zero,  then  an  overflow  has  occurred.  If  the  amount  of  the  shift  is 
greater  than  31,  then  the  shift  amount  should  be  set  to  31;  this  means  that  the  oper¬ 
and  is  less  than  1.0. 

b.  Derive  the  guard,  round,  and  sticky  bits. 

2.  If  the  operand  is  negative,  complement  the  result  from  step  1. 

3.  Generate  A,  A+1  using  a  32  bit  adder.  The  incremented  version  will  be  used  for  add¬ 
ing  either  a  complementing  or  a  rounding  one.  It  is  necessary  to  add  one  or  the  other 
but  not  both,  as  will  be  discussed  below.  Select  between  the  two  results  based  on  the 
sign  of  the  input  operand,  the  rounding  mode,  and  the  guard/round/sticky  bits. 


Invalid:  1.  Source  operand  is  infinity. 

2.  Source  operand  is  a  signalling  NaN  and  this  exception  is  enabled. 

3.  An  overflow  has  occurred. 

Inexact:  If  any  of  the  guard,  round,  or  sticky  bits  are  set. 

Underflow:  Cannot  occur. 

Unimplemented:  1 .  Source  operand  is  a  signalling  NaN  and  the  Valid  exception  is  not 
enabled. 

2.  The  operand  is  denormalized. 

3.  Instroction  attempts  to  convert  from  word  to  word,  which  is  not 
allowed. 

4.  Underflow  has  occurred. 

Overflow:  If  an  overflow  has  occurred. 


cvt.w.s 

la.  Shift  mantissa  right  by  -  ( (- 127, q+c,)  + 1)  +32,0  =  -ej+  158,q.  If  the  amount  of  the 
shift  is  less  than  zero,  then  an  overflow  has  occurred.  If  the  amount  of  the  shift  is 
greater  than  3 1 ,  then  the  shift  amount  should  be  set  to  3 1 ;  this  means  that  the  oper¬ 
and  is  less  than  1.0. 

b.  Derive  the  guard,  round,  and  sticky  bits. 
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2.  If  the  op>erand  is  negative,  complement  the  result  from  step  1. 

3.  Generate  A.  A+1  using  a  32-bit  adder.  The  incremented  version  will  be  used  for  add¬ 
ing  either  a  complementation  or  a  rounding  one.  It  is  necessan  to  add  one  or  the 
other  but  not  both,  as  will  be  discussed  below.  Select  between  the  two  results  based 
on  the  sign  of  the  input  operand,  the  rounding  mode,  and  the  guard/round/sticky  bits. 


Invalid:  1.  Source  operand  is  infinity. 

2.  Source  operand  is  a  signalling  NaN  and  this  exception  is  enabled. 

3.  An  overflow  has  occurred. 

Inexact:  If  any  of  the  guard,  round,  or  sticky  bits  are  set. 

Underflow;  Caimot  occur. 

Unimplemented:  1 .  Source  operand  is  a  signalling  NaN  and  the  Valid  exception  is  not 
enabled. 

2.  The  operand  is  denormalized. 

3.  Instruction  attempts  to  convert  from  word  to  word,  which  is  not 
allowed. 

4.  Underflow  has  occurred. 

Overflow:  If  an  overflow  has  occurred. 


Several  observations  can  be  made  about  the  organization  of  Figure  4.6.  First,  all 
conversion  types  have  a  latency  of  only  2  cycles.  Architectural  simulations  have  shown  that 
conversions  are  used  infrequently  and  have  little  impact  on  overall  performance,  so  a  short 
latency  may  not  be  of  tremendous  importance.  However,  a  uniform  latency  does  simplify 
various  parts  of  the  issue  logic,  including  reserving  a  result  bus.  Second,  rounding  will  nev¬ 
er  require  adding  both  complementing  and  rounding  ones  to  the  mantissa.  The  reason  is  re¬ 
lated  to  the  nature  of  rounding.  Rounding  is  necessary  whenever  there  are  too  many  bits  of 
precision  in  an  intermediate  result  to  fit  into  the  width  of  a  format.  In  other  words,  for  a 
normalized  number,  there  are  bits  “hanging  off  the  end.”  In  order  for  a  complementing  one 
to  propagate  from  these  extra  bits  up  into  the  final  section  of  the  mantissa,  these  additional 
bits  must  be  all  zero  prior  to  the  complementation.  This  means  the  guard,  round,  and  sticky 
bits  are  all  zero  and  it  is  not  necessary  to  round  the  result.  Thus,  only  A  and  A+1  need  to  be 
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generated  but  never  A+2,  simplifying  the  design  of  the  mantissa  adder.  This  observation  is 
also  used  to  simplify  the  rounding  logic  in  the  add  unit. 

The  preceding  summary  of  conversion  types  shows  that  several  instructions  ma\'  re¬ 
quire  a  1-bit  normalization  shift  after  the  mantissa  adder.  This  normalization  step,  which  is 
implemented  with  a  multiplexor,  is  on  a  critical  path.  Waiting  for  the  adder  result  before 
generating  the  select  signals  for  the  multiplexor  causes  unacceptable  delay.  The  two  rea¬ 
sons  for  adding  a  one  to  the  mantissa  are,  as  just  discussed,  complementation  (c^)  and 
rounding  (cj.  A  is  used  for  both  instructions  that  convert  from  an  integer  operand  to  a 
single  or  double  precision  result.  These  conversions  will  never  require  a  1-bit  normalization 
shift,  since  there  is  no  way  for  a  one  to  propagate  from  the  most  significant  bit  of  the  adder. 
In  order  to  do  so,  the  operand  after  complementation  would  need  to  be  all  ones;  this  corre¬ 
sponds  to  an  input  operand  of  zero,  which  would  not  need  to  be  complemented  in  the  first 
place.  On  the  other  hand,  a  c,  can  propagate  out  of  the  msb.  Consider  the  following  three 
simplified  cases: 


A 

B 

c 

100 

no 

111 

+1 

+1 

+1 

101 

111 

1000 

The  results  of  cases  A  and  B  do  not  need  to  be  normalized,  but  that  of  C  does.  Be¬ 
cause  case  C  arises  only  when  all  bits  of  the  input  to  the  adder  are  set,  this  condition  is  easily 
detected  in  parallel  to  the  actual  addition,  reducing  the  logic  along  the  critical  normalization 
path.  The  cvt.d.w  instruction  will  never  need  this  normalization  since  it  can  add  a  comple¬ 
mentation  one  (c^),  but  not  a  rounding  one  (c^).  The  cvt.s.d  and  cvt.s.w  instructions  both 
utilize  a  rounding  and  benefit  from  this  approach. 

Leading-one  detection  logic  is  somewhat  simpler  for  the  conversion  unit  than  for 
the  add  unit,  since  there  is  only  one  input  operand.  The  top  bit  of  the  Ubar  vector,  Ubar^^ , 


is  zero.  For  lower-order  bits,  the  logic  looks  at  3  bits  simultaneously  (although  the  third  bit 
will  be  shown  to  be  unnecessar\')  and  creates  a  vector  “Ubar",  where  the  first  bit  set  from 
the  left  indicates  the  position  of  the  leading  one.  The  various  possibilities  are  shown  in 
Table  4.6. 

The  defining  equation  is:  ((/i,^,+/t,)  +  (a”,  +T,)  )  =  Ubar,  which  is  simply  an  XOR 
of  and  A,.  The  Ubar  vector,  which  may  contain  several  logic-ones,  is  translated  in  a 
vector,  “SH”,  which  has  only  one  bit  set  at  the  position  of  the  leading-one.  This  in  turn  is 
translated  into  a  5  bit  binary  encoding  which  is  used  to  control  the  normalization  shifter.  As 
in  the  add  unit,  the  SH  vector  is  generated  using  a  parallel  look-ahead  approach  in  order  to 
reduce  the  gate-depth  of  this  logic  and  meet  the  target  of  20  gates  per  phase.  The  equation 
which  defines  SH  is:  SH-  =  AND{Ubar^  .y  ubar-_^) .  The  final  version  of  this  design  was  ver¬ 
ified  using  random  operands  with  more  than  10  million  vectors. 

4.3  Multiply  Unit 

Conventional  approaches  to  multiplication  involve  a  partial  product  array  (3-2  or  4- 
2  carry-save  adders)  followed  by  a  carry-propagate  mantissa  adder.  4-2  carry-save  adders 
create  a  more  area-efficient  layout  than  3-2  adders  by  providing  both  a  regular  placement 

Table  4.6  Leading  One  Prediction  for  the  Conversion  Unit 


3  bit  window 
''/•+  \^Ai-  1 

t/^ar-commcnt 

000 

0 

001 

0 

since  this  will  give  a  shift  of  1  too  few 

010 

1 

on 

1 

100 

1 

since  a  negative  number  will  also  need  a 
nonnalization  shift 

101 

1 

110 

0 

since  this  will  give  a  shift  of  1  too  few 

111 

0 
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of  cells  and  simpler  routing.  Booth  recoding  of  the  input  operands  can  be  used  to  reduce  the 
number  of  levels  in  the  array  by  one,  at  the  expense  of  adding  recoding  muxes.  A  compar¬ 
ison  of  Booth  and  non-Booth  approaches  shows  that  the  transistor  count  is  lower  for  the 
Booth  recoded  version,  but  the  area  is  greater  due  to  the  fact  that  the  routing  is  more  com¬ 
plicated  (refer  to  Table  4.7).  Increases  in  capacitance  along  critical  paths  of  a  Booth  recod¬ 
ed  multiplier  can  offset  the  reduction  in  logic  depth.  This  conclusions  was  reached  by 
creating  a  structural  description  for  each  design.  However,  instead  of  fully  implementing 
the  designs,  we  used  bounding  boxes  to  represent  the  datapath  leaf  cells.  This  approach  en¬ 
abled  a  fast  turn-around  for  placement,  routing,  and  timing  analysis.  Only  after  evaluating 
the  different  designs,  did  we  design  the  cells  for  the  iterative  non-Booth  multiplier  that  was 
finally  implemented.  For  comparison,  the  fastest  CMOS  double  precision  multiplier  to  date 
generates  a  result  in  8.8ns  [Makino93]. 

The  other  alternative  evaluated  was  the  iterative  use  of  a  smaller  array.  This  ap¬ 
proach  reduces  the  size  of  the  multiplier  considerably,  although  it  requires  5  cycles  to  pro¬ 
duce  a  result.  Such  a  multiplier  is  blocking,  in  the  sense  that  additional  multiply  instructions 
must  wait  for  the  currently  executing  instruction  to  complete.  A  block  diagram  of  this  non- 
pipelined  implementation,  which  was  used  in  the  Aurora  HI  FPU,  is  shown  in  Figure  4.7. 
As  discussed  earlier,  the  2-cycle  pipelined  multiply  umt  results  in  a  10%  improvement  in 
performance,  but  it  is  too  costly  in  terms  of  area  given  the  integration  constraints  for  GaAs; 
the  faster  unit  would  have  accounted  for  almost  a  third  of  the  overall  chip  area.  Integer  mul¬ 
tiplication  is  also  performed  in  the  multiply  unit.  The  final  version  of  this  design  was  veri¬ 
fied  using  random  operands  with  more  than  10  million  vectors.  The  design  and  analysis  of 


Table  4.7  Multiplier  Implementations 


Design 

Area 

(mm2) 

Transistors 

Transistors  per 
square  mm 

Delay 

(ns) 

(4-2)  Non- 
Booth 

32.41 

138,565 

4,275 

7.70  (2  cycles) 

(4-2)  Booth 

39.75 

118,432 

2,979 

7.67  (2  cycles) 

(4-2)  Iterative, 
Non-Booth 

24.19 

94,501 

3,907 

20.0  (5  cycles) 
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Figure  4.7  5  Cycle  Iterative  Partial-Array  Multiply  Unit 
the  multiplication  unit  was  done  by  Mike  Riepe  and  is  described  further  in  [Riepe93], 

[Riepe94]. 


4.4  Divide  Unit 

As  described  in  Section  3.8.1,  non-restoring  division  algorithms  can  be  combined 
with  the  representation  of  intermediate  results  in  a  higher  radix  redundant  form.  SRT-2, 
SRT-4,  and  SRT-8  approaches  generate  1,  2,  and  3  bits  per  cycle,  respectively,  and  laten¬ 
cies  vary  from  20  to  50-1-  cycles.  Additional  techniques  can  be  employed  to  perform  on-the- 
fly  conversion  from  redundant  form  to  sign-magnitude  form  and  on-the-fly  rounding  of  the 
result.  A  square  root  instruction  can  be  fairly  easily  mapped  to  the  same  hardware  used  for 
division,  with  little  increase  in  area.  The  design  and  analysis  of  the  divide  unit  was  done  by 
Dave  Putti  and  is  described  further  in  [Putti93].  The  divide  unit  was  not  included  in  the  cur¬ 
rent  implementation  of  the  FPU  due  to  area  limitations;  future  improvements  in  process 
technology  should  allow  the  divide  unit  to  be  added  to  the  design. 

4.5  Precise  Exceptions 

Being  precise,  with  regard  to  exceptions,  means  that  the  machine  state  at  the  time 
of  the  exception  is  the  same  as  for  a  sequential  CPU  model;  all  instructions  issue  and  com- 
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plete  in  order.  A  formal  definition  might  be  [lacoboviciSS],  [SrruthSS]: 

1)  All  instructions  prior  to  the  interrupting  one  have  completed. 

2)  All  subsequent  instructions  are  unexecuted  and  have  not  modified  state.  For  more 
ambitious  architectures,  this  may  result  in  a  need  to  restore  the  program  counter,  sta¬ 
tus  registers,  and  RF  operands. 

3)  If  an  instruction  causes  an  exception,  the  program  counter  points  to  the  address  of 
this  instruction  for  use  by  the  exception  handler. 

There  are  two  classes  of  exceptions,  the  first  relating  to  floating-point  computation 
and  the  second  concerning  memory  faults;  since  they  are  constrained  by  different  issues 
they  will  be  discussed  separately. 

4.5.1  Floating-point  Computation  Exceptions 

There  are  two  issues  related  to  precise  exceptions:  how  much  should  the  IPU  slip 
ahead  of  the  FPU  and  how  much  can  be  done  in  parallel  within  the  FPU.  The  latter  is  han¬ 
dled  effectively  by  a  reorder  buffer.  However,  with  the  use  of  an  instruction  queue,  the 
former  can  be  quite  costly,  since  the  IPU  may  have  slipped  far  ahead  of  the  FPU.  To  main¬ 
tain  precise  exceptions,  it  would  be  necessary  to  back-up  the  state  of  the  IPU  to  the  instruc¬ 
tion  after  the  exceptional  one,  at  the  expense  of  much  storage  logic  and  a  possible  increase 
in  critical  path  lengths.  The  simplest  approach  for  synchronizing  the  IPU  and  FPU  is  to 
have  floating-point  operations  always  stall  the  integer  pipeline.  A  better  variant  of  this  op¬ 
tion  involves  exception  prediction,  where  parts  of  the  operands  are  compared  to  certain 
constants  to  determine  if  an  exception  is  possible.  As  an  example,  if  the  biased  exponent 
field  of  both  single  precision  operands  is  less  than  192,  an  overflow  will  not  occur.  A  re¬ 
quirement  to  be  precise  will  always  impact  performance  and  a  conservative  prediction  pol¬ 
icy  costs  more  in  performance. 

There  are  several  implications  in  not  supporting  precise  floating-point  computation 
exceptions:  1)  a  greater  burden  is  placed  upon  the  software,  and  2)  it  is  more  difficult  to 
restart  a  program  after  an  exception.  The  latter  concern  would  be  a  problem  for  real-time 
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systems  which  must  always  be  able  to  produce  a  correct  result  and  cannot  allow  the  termi¬ 
nation  of  a  program,  such  as  the  control  system  for  an  airplane.  However,  the  environment 
of  a  workstation  is  somewhat  more  tolerant  and  the  primary  concern  here  is  that  a  program 
will  abort  after  running  for  many  hours.  Several  techniques  can  be  used  to  reduce  the  neg¬ 
ative  effects  of  imprecise  exceptions  while  still  taking  advantage  of  the  potential  perfor¬ 
mance  gains  seen  by  not  being  precise.  First,  the  additional  range  that  an  extended- 
precision  mode  offers  might  reduce  exceptions  for  sections  of  code  that  are  required  to  be 
reliable.  However,  support  for  these  less  frequently  utilized  extended  formats  may  both  in¬ 
crease  and  complicate  the  resources  devoted  to  floating-point  functionality.  The  approach 
used  in  the  Aurora  IE  FPU  implements  a  separate  precise  mode  of  operation,  where  setting 
a  bit  in  the  IPU  and  FPU  control  registers  ensures  that  a  precise  execution  model  is  fol¬ 
lowed.  After  one  or  two  instructions  are  transferred  to  the  floating-point  instruction  queue, 
the  FPU  asserts  the  queue-full  condition,  relaxing  it  only  when  each  transferred  instruction 
has  completed  without  generating  an  exception.  The  IPU  will  not  proceed  until  the  FPU  has 
determined  that  the  instructions  are  exception-free.  In  addition  to  this  method,  check-point¬ 
ing  of  some  type  might  be  used  to  periodically  save  the  state  of  the  processor,  so  as  to  allow 
restarting.  If  an  exception  were  encountered,  the  precise  mode  would  be  enabled  and  the 
program  would  be  restarted  from  the  last  checkpoint.  A  trap  handler  would  then  be  called 
when  the  exceptional  instruction  is  reached,  and  a  result  would  be  returned  by  software  rou¬ 
tines. 

The  great  majority  of  stable,  well-tested  applications  do  not  experience  floating¬ 
point  exceptions;  for  example,  there  are  no  computation  exceptions  found  in  the  tens  of  mil¬ 
lions  of  cycles  for  the  SPECfp92  benchmarks.  If  a  program  shows  a  tendency  toward  en¬ 
countering  exceptions,  the  code  may  need  to  be  rewritten  to  detect  exceptions  before  they 
occur.  In  a  sense,  implementing  precise  exceptions  is  not  consistent  with  the  RISC  philos¬ 
ophy  of  emphasizing  speed  and  justifying  hardware  expenditure  based  on  frequency  of  use. 
For  instance,  page-faults  are  a  common  occurrence  and  require  precise  handling;  floating¬ 
point  exceptions  are  quite  rare  and  so  one  should  not  spend  significant  complexity  or  re¬ 
sources  implementing  them.  It  is  interesting  to  note  that  several  commercial  machines  have 
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been  uninlentionaJly  imprecise,  including  the  IBM  360/91,  Cray,  and  CDC  6600/7600,  and 
that  a  growing  number  of  current  machines  offer  a  high  performance  mode  which  does  not 
support  precise  floating-point  exceptions,  including  processors  such  as  the  RS/6000  from 
IBM  and  the  R8000/TFP  from  MIPS/SGI. 

4.5.2  Memory  Exceptions 

Memory  exceptions  are  a  different  class  from  floating-point  computation  excep¬ 
tions,  in  that  a  computer  must  be  precise  with  respect  to  page  faults.  Whereas  floating-point 
exceptions  occur  infrequently,  memory  exceptions  are  a  part  of  the  normal  operation  of  a 
program  and  must  preserve  the  in-order  sequence  of  instructions.  In  part,  this  means  that 
floating-point  loads  and  stores  must  reserve  an  integer  reorder  buffer  entry,  as  do  integer 
loads  and  stores.  If  a  page  fault  occurs  for  a  floating-point  load  or  store,  the  exception  field 
in  the  integer  reorder  buffer  entry  for  the  corresponding  instruction  will  be  marked.  When 
this  instruction  reaches  the  head  of  the  reorder  buffer,  an  exception  is  signalled  and  pro¬ 
cessed.  All  subsequent  instructions  are  flushed  from  the  reorder  buffer.  The  state  of  the 
FPU  upon  detection  of  a  memory  exception  must  be  consistent  with  that  of  the  IPU;  in  other 
words,  no  instructions  after  the  exceptional  one  can  be  allowed  to  change  the  state  of  the 
FPU  (write  the  register  file  or  control  register).  In  order  to  implement  this,  all  load  and  store 
instructions,  regardless  of  whether  they  are  integer  or  floating-point,  must  be  sent  to  the 
FPU.  In  addition,  the  tag  corresponding  to  the  integer  reorder  buffer  entry  must  accompany 
the  instruction.  When  the  integer  load/store  instruction  reaches  the  head  of  the  instruction 
queue  it  will  issue  within  the  FPU  and  will  be  allocated  a  floating-point  reorder  buffer  en¬ 
try.  When  it  is  known  that  a  load/store  will  not  cause  a  page  fault,  both  the  integer  reorder 
buffer  tag  and  a  valid  signal  will  be  sent  to  the  FPU.  Using  a  small  lookup  memory,  the 
correct  floating-point  reorder  buffer  entry  will  be  marked  as  valid.  This  approach  requires 
two  exception-signalling  pins  between  the  IPU  and  FPU,  one  for  when  the  IPU  recognizes 
a  memory  fault  and  one  for  when  the  FPU  recognizes  a  floating-point  numerical  exception. 
In  addition  to  the  exception  signal,  the  IPU  must  send  to  the  FPU  the  tag  of  the  integer  re¬ 
order  buffer  entry  that  caused  the  exception,  since  this  same  instruction  may  not  yet  have 


reached  the  head  of  the  floating-point  reorder  buffer.  To  put  this  another  way.  the  FPU  may 
not  have  completed  executing  all  instructions  that  occurred  pnor  to  the  exceptional  one. 
The  FPU  must  finish  these  outstanding  instructions  before  any  additional  instructions  are 
transferred.  After  catching  up  to  the  IPU,  the  FPU  must  perform  the  following; 

1.  all  scoreboard  valid  bits  are  cleared  since  all  instructions  after  the  exceptional  one 
will  be  discarded, 

2.  all  reorder  buffer  valid  bits  are  cleared, 

3.  all  result  bus  valid  bits  are  be  cleared,  in  order  to  ensure  that  a  reorder  buffer  valid 
bit  is  not  subsequently  and  erroneously  set, 

4.  all  head  and  tail  pointers  are  reset,  including  those  for  the  reorder  buffer,  instruction, 
load,  and  store  queues. 

The  FPU  ensures  that  no  additional  instructions  are  transferred  by  simply  asserting 
the  queue  full  condition  until  the  internal  state  of  the  FPU  has  caught  up  to  the  IPU.  In  the 
meantime,  the  IPU  is  free  to  begin  executing  integer  instructions  from  the  memory  fault 
trap  handler.  As  discussed  earlier,  all  compare  and  store  instructions  must  pass  through  the 
floating-point  reorder  buffer  prior  to  writing  state,  in  order  to  ensure  a  precise  model  of  ex¬ 
ecution  for  memory  faults.  Finally,  the  PU  (and  LSU)  must  continue  to  retire  any  valid 
store  queue  entries  since  these  will  correspond  to  instructions  that  occurred  prior  to  the  ex¬ 
ceptional  one. 

The  MIPS  ISA  also  requires  that  reads  and  writes  of  the  floating-point  status  regis¬ 
ter  must  stall  until  all  issued  instructions  have  completed  and  have  written  back  to  the  reg¬ 
ister  file.  The  former  constraint  is  necessary  since  there  is  no  additional  hardware  to  restore 
the  contents  of  the  status  register  if  an  exception  occurs  and  a  subsequent  instruction  has 
changed  the  register.  Access  of  the  status  register  is  rare  (2.2%  of  dynamic  instructions,  on 
average  across  the  SPECfp92  benchmarks)  and  does  not  justify  the  cost  of  additional  re¬ 
covery  logic.  Passing  the  data  to  be  written  to  the  status  register  through  the  reorder  buffer 
in  order  to  maintain  a  precise  execution  model  is  difficult,  since  several  fields  in  the  status 
register  (rounding  modes,  exception  enables)  are  fed  directly  to  the  various  functional 
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units.  Waiting  for  the  status  register  data  to  reach  the  head  of  the  reorder  buffer  before  com¬ 
mitting  the  write  would  mean  that  subsequent  instructions  rmght  not  be  executed  with  the 
correct  modes.  Bypassing  these  fields  directly  from  each  reorder  buffer  entiy  would  be 
costly.  The  constraint  of  stalling  the  issue  of  a  status  register  instruction  until  all  outstand¬ 
ing  instructions  have  written  the  register  file  is  necessary  since  a  result  in  the  reorder  buffer 
may  cause  an  exception,  but  will  not  be  detected  until  it  reaches  the  head  of  the  reorder 
buffer. 


4.6  Implementing  Floating-point  Loads 

The  determination  of  whether  load  data  is  valid  occurs  some  number  of  cycles  after 
a  floating-point  load  instruction  has  been  transferred  to  the  FPU.  Data  for  a  load  miss  will 
actually  be  sent  twice  to  the  FPU;  the  first  time  occurs  prior  to  knowing  whether  a  data 
cache  hit  has  occurred  and  the  data  is  sent  directly  from  the  cache  via  the  dcOut  bus.  Later, 
when  the  miss  data  is  either  received  from  the  BIU/MMU  or  is  retrieved  from  the  write 
cache  it  will  be  sent  to  the  FPU  using  the  dcin  bus.  These  different  events  require  a  means 
of  applying  the  validation  signal  to  the  correct  load  queue  entry.  One  approach  would  re¬ 
quire  the  load-store  unit  to  maintain  information  about  which  load  queue  entry  an  outstand¬ 
ing  load  instruction  has  been  allocated.  This  tag  would  need  to  be  sent  along  with  the 
validation  signal  in  order  to  ensure  that  the  correct  load  queue  entry  is  marked  as  valid.  The 
load-store  unit  can  store  the  address  of  the  tail  entry  of  the  load  queue  (a  new  floating-point 
load  instruction  always  allocates  the  tail  entry  of  the  load  queue),  since  it  is  the  IPU  that 
initiates  pushes  and  pops  to  the  load  queue.  In  other  words,  the  load-store  unit  can  have  a 
duplicate,  independent  copy  of  the  Lq  tail  pointer,  and  at  any  given  time,  the  FPU  and  load- 
store  unit  copies  should  be  the  same.  For  a  load  hit  in  the  data  cache,  this  load  queue  tag 
will  stay  in  the  load-store  unit  pipeline  until  the  tag  comparison  has  been  made,  at  which 
time  the  tag  will  be  sent  to  the  FPU  along  with  the  validation  signal.  For  a  load  miss,  the 
load  queue  tag  will  be  written  to  the  miss-status-holding  register  (mshr)  that  corresponds  to 
the  floating-point  load  instruction.  When  the  load  data  is  returned  from  the  BIU/MMU,  this 
tag  will  be  read  from  the  mshr  and  sent  with  the  data  and  validation  signal  to  the  FPU. 
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A  preferred  option  for  guiding  load  data  and  validation  signals  to  the  correct  load 
queue  entry  will  be  presented  here.  Similar  to  the  approach  for  precise  memors'  exceptions 
described  in  Section  4.5.2,  w'hen  a  floating-point  load  instruction  is  transferred  to  the  FPU. 
the  corresponding  integer  reorder  buffer  tag  is  also  sent.  The  entry  number  in  the  load  queue 
for  the  floating-point  load  being  issued  is  written  into  another  small  tag  memory  (LqTag- 
Mem).  When  data  from  a  load  miss  arrives  from  the  BIU,  or  when  data  previously  written 
to  the  FPU  is  known  to  have  hit  in  the  data  cache,  the  integer  reorder  buffer  tag  for  this 
memory  reference  is  resent  to  the  FPU  along  with  data  and  validation  signal.  The  tag  mem¬ 
ory  is  read  to  obtain  the  correct  entry  number  in  the  load  queue.  This  tag  memory  must  have 
as  many  entries  as  there  are  integer  reorder  buffer  entries.  This  approach  was  selected  for 
the  Aurora  IE  design  because  it  requires  minimal  additions  to  the  existing  load-store  unit 
functionality. 

In  a  number  of  cases,  the  IPU  needs  to  send  an  integer  reorder  buffer  tag  to  the  FPU. 
These  are  when: 

a)  writing  LqTagMem  when  a  floating-point  load  instruction  is  transferred  to  the  PT*U, 

b)  writing  the  load  queue  with  data  from  the  data  cache  (occurs  2  cycles  after  the 
instruction  was  transferred), 

c)  writing  a  load  queue  valid  bit  for  a  data  cache  hit  (happens  in  the  second  cycle  after 
the  tag  was  returned), 

d)  writing  the  load  queue  with  data  and  the  valid  bit  for  a  write  cache  hit  or  a  load  miss 
that  has  been  returned  via  the  BIU. 

These  lead  to  the  following  constraints: 

1.  a  and  d  cannot  happen  simultaneously,  since  load  data  sent  to  the  FPU  via  dcin  is 
prioritized  higher  than  the  transfer  of  floating-point  instructions. 

2.  (a  or  d),  b,  and  c  can  occur  simultaneously,  which  calls  for  a  small  integer  reorder 
buffer  tag  pipeline  in  the  FPU,  to  track  the  progress  of  load  data  through  the  data 
cache;  this  approach  requires  only  one  external  tag  bus  between  the  IPU  and  FPU. 


3.  a  requires  write  access  to  LqTagMem. 

4.  b,  c,  d  all  need  simultaneous  read  access  to  LqTagMem.  This  will  require  3  read  pons 
for  LqTagMem. 

5.  b  and  either  a  (for  move-to-FPU  instructions)  or  d  need  simultaneous  write  access 
for  load  queue  data.  Consequently,  the  load  queue  will  need  to  have  2  write  pons. 

6.  c  and  either  a  (for  move-to-FPU  instructions)  or  d  need  simultaneous  write  access  for 
load  queue  valid  bits.  Two  valid-bit  write  pons  will  be  needed  for  the  load  queue. 

4.7  Implementing  Floating-point  Stores 

Store  data  will  be  written  into  the  store  queue  in  the  same  order  it  occurs  in  the  pro¬ 
gram;  in  other  words,  there  is  no  need  to  reorder  the  stores  as  they  are  extracted  from  the 
store  queue  by  the  load-store  unit.  In  addition  to  the  result  data,  the  store  queue  will  also 
contain  a  store-type  designator  and  integer  reorder  buffer  tag  field  (both  of  which  will  need 
to  be  dedicated  pins).  There  are  two  distinct  classes  of  store  instructions,  those  that  send 
data  to  memory  (swcl/sdcl)  and  those  that  transfer  data  to  the  integer  register  file  (mfcl/ 
cfcl).  Only  the  swcl/sdcl  instructions  will  write  data  to  the  write  cache.  An  alternative  to 
sending  the  store  type  and  destination  field  from  the  FPU  to  the  IPU  would  be  to  have  a 
small  queue  in  the  load-store  unit.  This  queue  would  be  written  when  the  floating-point 
store  instruction  is  transferred  to  the  FPU.  However,  this  queue  would  need  to  have  as  many 
entries  as  the  possible  number  of  outstanding  floating-point  stores  (the  number  in  the  in¬ 
struction  queue  plus  the  number  in  the  store  queue  plus  one  for  the  store  unit  pipe  stage). 
The  designer  is  faced  with  a  choice  between  a  few  additional  pins  addition  of  a  fairly  large 
memory  structure.  To  minimize  chip  area,  because  of  the  limited  integration  levels  of 
GaAs,  I  chose  the  former  approach  for  the  Aurora  HI  FPU. 

The  main  complexity  in  implementing  floating-point  store  instructions  concerns  the 
fact  that  the  data  is  not  ready  when  the  write  cache  entry  is  allocated  (or  overwritten  for  a 
store  write  cache  hit).  For  an  integer  store,  the  procedure  is  to  allocate  a  new  write  cache 
entry  if  the  store  misses  in  the  write  cache.  The  address  and  data  are  written  at  this  point 
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while  the  rest  of  the  line  is  written  when  the  data  is  available  from  the  data  cache  or  sec¬ 
ondary  memor>'.  Since  the  floating-pomt  data  is  not  ready  at  the  time  the  wnte  cache  entry 
is  allocated,  either  the  data  or  the  rest  of  the  line  may  appear  first.  In  addition,  more  than 
one  floating-point  store  may  hit  in  the  same  write  cache  entry-  (and  word  within  that  entry). 
It  is  important  to  not  mark  the  entire  line  as  valid  if  one  or  more  floating-point  stores  are 
still  outstanding.  The  load-store  unit  also  needs  to  know  to  which  write  cache  entry-  the 
floating-point  store  data  to  be  written.  In  the  discussion  that  follows,  it  is  assumed  that  each 
write  cache  entry  has  separate  valid  bits  for  each  word  of  the  line,  and  there  are  8  words  per 
entry.  An  “fpbusy”  valid  bit  is  also  needed  for  each  word.  The  sequence  for  handling  float¬ 
ing-point  stores  is  as  follows: 

1.  The  store  instruction  and  address  are  forwarded  from  the  ALU  stage  of  the  IPU  to  the 
first  stage  of  the  load-store  unit.  If  the  address  hits  in  the  write  cache  but  the  fpbusy 
flag  is  set  for  this  word,  the  instruction  is  set  aside  (this  will  be  described  below  in 
more  detail).  Otherwise,  the  floating-point  store  instruction  is  transferred  to  the  FPU. 

2a.  On  a  write  cache  miss,  a  write  cache  entry  is  allocated  and  the  address  is  written  to 
the  entry.  A  lazy  writeback  policy  ensures  that  there  is  a  free  entry;  an  instruction 
will  not  be  forwarded  to  the  load-store  unit  if  all  write  cache  entries  have  words  out¬ 
standing.  The  word  valid  bit  is  cleared  (unless  this  is  an  integer  store,  in  which  case 
the  word  valid  would  be  set)  and  the  ^busy  bit  is  set. 

b.  On  a  write  cache  hit,  the  fpbusy  bit  is  set  and  the  word  valid  bit  is  cleared.  If  the 
^busy  bit  was  set,  another  floating-point  store  to  the  same  word  in  this  line  is  out¬ 
standing. 

3a.  For  a  write  cache  miss,  the  line  will  either  be  in  the  data  cache  or  will  need  to  be 
fetched  via  the  BIU/MMU.  In  either  case,  when  the  line  is  returned  each  word  valid 
bit  is  set,  unless  the  fpbusy  bit  is  set.  The  data  for  a  word  of  the  entry  should  not  be 
written  if  the  valid  bit  is  already  set  (this  happens  if  a  previous  instruction  was  an 
integer  store,  or  if  the  FPU  has  returned  the  outstanding  floating-point  store  data 
prior  to  the  line  arriving). 
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b.  At  some  point,  either  before  or  after  3a.  the  store  queue  will  be  loaded  with  the 
required  word.  A  separate  dedicated  tag  bus  from  the  FPU  to  the  IPU  will  contain 
the  integer  reorder  buffer  tag  for  the  floating-point  store  at  the  head  of  the  store  queue 
(this  tag  was  sent  initially  along  with  the  instruction  to  the  FPU).  This  tag  from  the 
FPU  will  then  be  used  to  retrieve  the  address  from  the  integer  reorder  buffer  entiy, 
which  in  turn  will  be  sent  to  the  data  cache.  At  the  same  time,  the  data  will  be  sent 
to  the  data  cache  from  the  FPU  via  dcin,  and  will  also  be  sent  to  the  IPU  via  dcOut 
so  it  can  be  written  into  the  write  cache.  If  there  are  no  other  outstanding  stores  to 
the  same  word  of  this  line  (the  determination  of  this  will  be  discussed  below),  the 
fpbusy  flag  is  cleared  and  the  word  valid  flag  is  set.  Only  if  all  word  valid  flags  are 
set  can  this  write  cache  entry  be  written  back  to  the  data  cache  and  the  secondary 
memory  system.  Note  that  it  is  possible  for  the  write  cache  to  fill  up  without  being 
able  to  write  back  an  entry;  this  would  occur  when  all  entries  have  outstanding  float¬ 
ing-point  stores.  When  the  write  cache  is  full,  the  load-store  unit  must  not  accept  any 
more  store  instructions  until  an  entry  becomes  free. 

A  potential  problem  arises  when  a  given  entry  in  the  write  cache  has  more  than  one 
outstanding  floating-point  store  to  the  same  word.  The  processor  must  be  able  to  determine 
whether  an  entry  in  the  store  queue  is  the  last  one  to  reference  that  word  in  the  write  cache 
A  solution  to  the  problem  might  involve  assigning  an  “fpbusy”  tag  to  each  outstanding 
store;  this  tag  is  similar  in  nature  to  those  used  in  the  reorder  buffer  to  write-back  data  to 
the  register  file.  In  the  reorder  buffer  write-back  case,  there  can  be  multiple  entries  in  the 
reorder  buffer,  all  of  which  write  the  same  register;  the  scoreboard  must  not  be  cleared  until 
the  most  recent  register  reference  reaches  the  head  of  the  reorder  buffer.  This  is  accom¬ 
plished  by  storing  the  reorder  buffer  tag  for  the  most  recent  instruction  in  the  scoreboard 
entry  for  the  destination  register.  When  an  entry  reaches  the  head  of  the  reorder  buffer,  the 
scoreboard  is  cleared  only  if  the  number  of  that  entry  matches  the  scoreboard  tag  for  the 
destination  register.  A  similar  approach  could  be  applied  to  the  floating-point  store  prob¬ 
lem,  which  is  in  reality  a  reordering  problem.  The  size  of  the  tag  would  need  to  accommo¬ 
date  the  maximum  number  of  stores  that  can  be  outstanding  at  a  time.  When  the  write  cache 
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entr>’  is  allocated,  or  a  word  in  an  entr\-  is  overwntien  during  a  wnie  cache  hit.  this  tag 
would  be  w’ritten  to  a  field  in  the  enir\’  for  that  word.  When  the  data  was  returned  from  the 
store  queue,  the  fpbusy  tag  in  the  entr}  would  be  compared  to  the  fpbusy  tag  sent  along  with 
the  data  from  the  FPU.  If  the  two  were  the  same,  the  fpbusy  valid  bit  for  that  word  of  the 
write  cache  entry  would  be  cleared  and  the  word  valid  bit  would  be  set. 

As  an  alternative,  the  fpbusy  tag  could  be  replaced  with  an  fpbusy  count  field  for 
each  word  in  a  write  cache  entry.  This  count  field  would  be  incremented  or  decremented 
each  time  an  outstanding  floating-point  store  for  this  word  is  issued  or  retired,  respectively. 
The  count  field  needs  to  be  large  enough  to  accommodate  the  maximum  number  of  possible 
outstanding  floating-point  stores.  However,  if  the  count  field  uses  a  Johnson  encoding,  the 
incrementing/decrementing  could  be  done  without  an  adder,  at  the  expense  of  needing 
more  bits  of  storage  for  each  count  field.  The  fpbusy  valid  bit  would  be  cleared  and  the 
word  valid  bit  set  only  if  the  fpbusy  count  is  zero  for  a  store  retired  from  the  store  queue. 

Both  of  these  approaches  add  significant  complexity  and  resources  for  a  condition 
which  will  seldom  be  encountered.  A  final  option  would  also  involve  setting  an  fpbusy  bit 
whenever  a  floating-point  store  allocates  a  write-cache  entry  or  writes  to  an  existing  entry. 
However,  if  another  store  to  the  same  word  arrives  at  the  load-store  unit  while  the  first  one 
is  still  outstanding,  the  corresponding  instruction  and  address  would  be  placed  into  a  single 
entry  holding  register.  It  is  necessary  to  set  aside  the  instruction  in  this  way,  since  upon 
completion  of  the  first  floating-point  store,  the  address  of  this  first  instruction  must  be  re¬ 
sent  to  the  load-store  unit  from  the  integer  reorder  buffer  as  the  data  is  sent  to  the  IPU  via 
the  dcOut  bus.  This  action  needs  access  to  the  first  pipe  stage  of  the  load-store  unit  in  order 
to  rederive  the  address  of  the  write  cache  entry.  In  addition  to  setting  aside  the  second  store 
instruction,  the  load-store  unit  will  signal  the  ALU  stage  to  not  forward  any  additional  in¬ 
structions  until  the  second  store  has  been  activated,  after  the  first  store  has  completed  and 
written  the  write  cache  entry.  This  approach  is  preferable  to  the  first  two  for  a  number  of 
reasons.  First,  there  is  no  longer  a  need  for  additional  state  in  the  form  of  an  ^busy  tag  or 
count  fields  for  each  word  of  a  write-cache  entry.  Second,  the  case  of  having  two  or  more 
outstanding  stores  write  the  same  word  should  occur  infrequently,  therefor  a  solution 
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should  not  require  a  significant  increase  in  either  complexity  or  chip  area.  A  load  instruc¬ 
tion  which  hits  a  word  in  the  wnte  cache  while  a  floating-point  store  is  outstanding  to  this 
word  must  also  follow  this  set-aside  procedure.  Such  a  load  cannot  progress  further  since 
the  data  it  needs  is  not  yet  available.  The  mechanism  of  setting  aside  load  and  store  instruc¬ 
tions  which  reference  the  same  word  of  a  write-cache  entry  ensures  that  memory  references 
function  reliably  in  the  Aurora  HI  processor. 

4.8  Predecoding  Floating-point  Instructions 

Critical  paths  in  the  FPU  design  are  well  balanced  and  no  section  of  the  design  ex¬ 
ceeds  the  goal  of  20  gates  per  clock  phase.  The  issue  logic,  in  particular,  is  fairly  complex, 
having  its  operation  constrained  by:  data  dependencies  via  the  scoreboard  and  reorder  buff¬ 
er,  invalid  load  queue  entries,  busy  multiply  and  divide  units,  unavailable  result  busses,  an 
empty  instruction  queue,  a  full  reorder  buffer,  a  full  store  queue,  and  status  register  access¬ 
es.  After  issue  has  been  resolved,  a  host  of  other  actions  may  need  to  be  performed,  includ¬ 
ing:  setting  the  scoreboard  valid  and  tag  fields,  reserving  a  result  bus  entry,  initiating  a 
multiply  or  divide  operation,  reserving  a  reorder  buffer  entry,  selecting  the  source  for  op¬ 
erands,  and  advancing  to  the  next  instruction  queue  entry.  The  combination  of  decoding 
and  evaluating  the  two  instructions  at  the  head  of  the  queue  could  easily  exceed  the  target 
gate  path  depth.  Since  several  pipe  stages  are  required  for  floating-point  instructions  to 
reach  the  queue,  there  is  ample  opportunity  to  derive  a  set  of  predecoded  signals  which  can 
simplify  the  logic  needed  by  the  issue  clock  phase.  If  the  predecode  logic  were  placed  in 
the  IPU,  additional  pins  would  be  required  to  transfer  the  instructions.  Instead,  instructions 
are  predecoded  in  the  phase  immediately  before  they  are  written  into  the  queue.  This  time 
slot  occurs  naturally  in  the  Aurora  in  design;  if  the  predecoding  were  performed  earlier, 
this  phase  would  be  wasted.  The  use  of  predecoding  does  require  an  additional  15  bits  per 
entry  in  the  instruction  queue.  The  predecoded  signals  can  be  summarized  as  follows: 

1.  Iclass:  a  3-bit  tag  that  indicates  which  functional  unit  will  be  utilized  by  this  instruc¬ 
tion. 

2.  Idepclass:  a  2-bit  tag  that  refers  to  how  many  source  operands  are  used  by  the  instruc- 
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tion  (0,  1,  or  2). 

3.  Iresclass;  a  1-bit  tag  that  indicates  whether  the  instruction  produces  a  result  that  will 
need  to  pass  through  the  reorder  buffer. 

4.  FUspec:  a  3-bit  tag,  written  to  a  result  bus  shift  register,  identifies  which  functional 
unit  will  write  the  reorder  buffer  upon  completion  of  the  instruction. 

5.  FUvector:  a  5-bit  valid  vector,  written  to  a  result  bus  shift  register,  aligns  data  prop¬ 
erly  to  be  written  into  the  reorder  buffer. 

6.  IclassExcept:  a  1-bit  signal  that  identifies  certain  types  of  exceptions,  such  as  an 
invalid  opcode  or  an  operand  format  that  is  illegal  for  a  given  instruction. 

The  predecode  logic  is  duplicated  in  order  to  handle  the  2  floating-point  instructions 
that  can  be  transferred  to  the  FPU  per  cycle.  Table  4.8  summarizes  characteristics  of  this 
logic  and  shows  that  removes  as  many  as  12  gate  levels  from  the  issue  phase. 

4.9  Design-For-Test  Features 

Design-for-test  is  often  added  near  the  end  of  a  design,  and  designers  are  usually 
concerned  about  how  much  area  and  design  time  will  be  needed.  Full  scan  is  impractical 
for  an  integration-poor  technology  like  GaAs  DCFL.  Every  design-for-test  feature  added 
must  be  thoroughly  tested,  both  to  ensure  it  will  do  what  is  intended  and  to  ensure  that  it 


Table  4.8  Predecode  Logic  Statistics 


Signal 

Synthesized 
Logic  Depth 

Iclass 

10 

Idepclass 

8 

Iresclass 

10 

FUspec 

10 

FUvector 

12 

IclassExcept 

12 
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does  not  introduce  errors  into  a  stable  design.  With  these  consiraants  in  mind.  1  included  the 
following  features  in  the  FPU: 

1.  Two  scan  chains  for  the  register  file.  The  register  file  is  sense-amplifier  based;  the 
yield  associated  with  this  analog  design  and  dense  memoiy'  structures  in  GaAs  is  a 
concern.  The  two  scan  chains  provide  accessibility  to  the  register  file,  one  for  the  5 
address  ports  (1  write,  4  read)  and  one  for  the  input  and  output  data.  In  addition,  spe¬ 
cial  write  enable  and  clock  signals  have  been  added  to  allow  the  normal  write  logic 
to  be  bypassed  when  the  FPU  is  being  tested. 

2.  A  scan  chain  for  the  23  main  issue-logic  signals.  The  observability  gained  with  this 
scan  chain  should  one  to  identify  a  problem  in  some  component  of  the  FPU  that  pre¬ 
vents  issue  from  occurring.  These  signals  are  not  controllable,  since  they  originate 
from  parts  of  the  design  that  are  difficult  to  access  without  adding  design  complexity, 
such  as  the  valid  bits  for  the  reorder  buffer  and  scoreboard. 

3.  A  scan  chain  for  one  of  the  two  result  busses,  allowing  the  result  of  any  functional 
unit  to  be  verified  prior  to  writing  the  reorder  buffer  and  the  register  file. 

4.  A  scan  chain  for  the  top  2  entries  in  the  instruction  queue,  including  all  predecode 
signals. 

5.  A  test  signal,  testStallFPU,  which  ensures  that  the  issue  of  instructions  will  stall  until 
the  instruction  and  load  data  queues  have  been  loaded.  In  conjunction  with  some  of 
the  other  test  features,  this  allows  any  instruction  and  operands  to  be  loaded,  exe¬ 
cuted,  and  verified. 

6.  External  access  to  clock  and  reset  signals  on  the  distribution  network.  This  provides 
easy  verification  of  basic  functionality  during  initial  testing. 

Together,  these  additions  increase  the  chip  area  less  than  1%  while  providing  great¬ 
ly  improved  access  to  internal  points  in  the  design. 
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4.10  Hardware  Support  for  Denormals 

A  denormalized  number  occurs  when  a  result  exponent  is  too  small  to  be  represent¬ 
ed  by  a  given  floating-point  format.  The  IEEE  754  specification  requires  that  the  returned 
denormal  be  the  infinitely  precise  result  multiplied  by  a  large  constant  and  then  appropri¬ 
ately  rounded.  These  steps  can  be  accomplished  without  impacting  cycle  time  or  overall 
chip  area  by  using  a  state  machine  to  perform  the  necessary  mantissa  shift  and  exponent 
adjustment.  However,  the  additional  design  and  verification  complexity  was  not  seen  as 
consistent  with  the  goals  of  an  academic  project.  In  addition,  denormals  occur  rarely  and 
can  be  handled  in  software.  In  fact,  the  most  common  occasion  for  denormals  arises  during 
iteration  convergence  when  a  result  value  is  less  than  some  tolerance  threshold;  this  is  often 
produced  by  a  subtraction  and  a  subsequent  comparison  to  zero.  The  fact  that  a  denormal 
might  result  is  not  particularly  significant  since  for  all  purposes  the  difference  is  really  com¬ 
parable  to  zero.  The  MIPS  R40(X)  ISA  addresses  this  case  by  implementing  a  flush-to-zero 
mode  in  which  denormalized  results  are  set  to  zero.  The  Aurora  in  FPU  follows  this  ap¬ 
proach  and  will  trap  to  a  software  handler  if  the  mode  is  not  set. 


CHAPTER  5 


CAD  Support  for  High 
Performance  Designs 

5.1  General  Observations  on  CAD  for  VLSI 

During  the  design  of  a  chip,  there  often  arises  a  need  to  solve  a  specific  problem  by 
creating  a  custom  CAD  tool.  For  example,  a  utility  might  be  needed  to  determine  the  fanout 
for  each  gate  in  a  design  and  alert  the  designer  to  any  instances  which  exceed  a  certain 
threshold.  There  are  a  number  of  trade-offs  to  consider  when  implementing  these  tools,  in¬ 
cluding  what  programming  environment  to  use,  how  often  the  tool  will  be  used,  what  size 
of  a  design  the  tool  will  be  used  for,  and  how  long  it  will  take  to  create  and  verify  the  tool. 

In  many  cases,  the  tool  will  do  nothing  more  than  read  an  ASCII  file,  such  as  a 
netlist,  and  perform  some  transformation;  for  this  type  of  problem  a  script  language  such 
as  “awk”  or  “perl”  will  suffice.  For  example,  it  might  be  necessary  to  add  an  attribute  to 
critical  nets  in  order  to  ensure  that  the  corresponding  interconnect  wires  have  a  certain 
width,  and  hence  no  more  than  a  certain  value  of  resistance.  Scripting  languages,  which  of¬ 
fer  support  for  manipulating  text  files,  are  easy  to  use,  allowing  for  a  fast  design  cycle. 

There  is  a  trade-off  between  how  many  times  the  utility  will  be  run  and  how  fast  it 
needs  to  operate.  A  delay-calculator  will  be  run  many  times  throughout  the  design  cycle, 
whereas  a  tool  to  derive  the  power  dissipation  of  a  chip  may  only  be  used  several  times.  In 
the  latter  case,  a  scripting  language  might  be  acceptable,  even  if  it  results  in  a  runtime  on 
the  order  of  several  hours.  However,  for  applications  that  require  a  fast  runtime,  a  low-level 
programming  environment,  such  as  C  or  C-H-,  will  allow  the  programmer  to  better  tune  the 
utility  for  performance. 
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Care  must  be  taken  in  how  an  algorithm  is  implemented  and  there  are  a  number  of 
issues  to  consider.  The  use  of  recursion  often  provides  the  most  efficient  implementation, 
but  can  make  verification  somewhat  more  difficult.  It  is  imponant  to  keep  in  mind  the  size 
and  complexity  of  the  designs  that  the  utility  will  address.  This  will  often  determine  wheth¬ 
er  to  use  a  linked-list  or  a  hash  table.  Hash  tables  tend  to  require  more  effon  on  the  part  of 
the  programmer  but  are  often  invaluable  in  managing  the  data  structures  for  a  large  design. 
Linked-lists  are  easier  to  implement  but  should  be  used  only  in  cases  where  their  length  will 
allow  efficient  traversal.  Often  a  combination  of  hash  tables  and  linked-lists  offers  the  best 
solution.  Dynamic  allocation  of  memory  for  data  structures  is  another  important  CAD  is¬ 
sue.  Because  of  the  cost  associated  with  calling  allocation  routines,  it  is  important  to  ex¬ 
clude  the  use  of  these  routines  from  sections  of  code  which  are  executed  frequently.  Data 
structures  should  be  created  only  once  and  subsequently  accessed  with  pointers. 

The  nature  of  a  CAD  algorithm  itself  may  limit  performance.  The  levelizer  de¬ 
scribed  below  has  several  modes  of  operation,  one  of  which  can  lead  to  runtimes  of  several 
days  for  designs  having  large  amounts  of  connectivity.  The  issue  here  is  not  that  the  algo¬ 
rithm  has  been  implemented  poorly,  but  instead,  that  in  order  to  achieve  a  reasonable  runt¬ 
ime,  a  feature  of  the  utility  may  need  to  be  disabled.  In  many  cases,  the  most  important  issue 
concerns  how  long  it  takes  to  develop  the  tool.  The  tool  will  address  a  specific  issue  and  in 
so  doing  will  facilitate  a  stage  of  the  design  process;  often  it  is  difficult  to  proceed  with  a 
design  until  the  tool  has  been  completed.  Careful  work  in  the  early  implementation  stages 
can  minimize  the  iterations  required  to  improve  the  performance  of  the  tool. 

5.2  Delay  Calculation 

Delay  calculation  for  the  Aurora  HI  methodology  uses  2D  interpolation  tables  based 
on  work  by  [Kayssi93a].  The  tables  are  generated  by  running  HSPICE  for  each  primitive 
cell  with  various  combinations  of  interconnect  capacitance  and  fanout.  Since  the  building 
blocks  in  GaAs  DCFL  are  fairly  limited,  primitive  cells  consist  only  of  an  inverter,  NOR 
gates  of  fanin  between  2  and  4,  and  super-buffer  cells  which  have  a  fanin  between  1  and  8. 
Information  about  both  the  delay  and  rise/fall  time  (slew)  for  a  gate  are  calculated.  For  the 
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plain  DCFL  gales  the  axes  of  the  lookup  table  are: 


f  f^^drtvfr 


=  Fanouf  ; 


Cioiai  is  the  sum  of  the  interconnect  capacitance  and  an  empirically  derived  value  for 
gate  capacitance.  This  latter  relation  involves  the  width  of  the  driver,  the  width  of  all  driven 
gates,  and  whether  the  transition  is  rising  or  falling;  this  rise/fall  information  is  necessary' 
because  current  flows  into  the  gate  of  a  MESFET  transistor,  and  gate  capacitance  varies 
with  gate  current.  For  super-buffers,  the  axes  are  simply  the  total  capacitance  and  the  input 
slew  rate. 

Traversal  of  a  design  uses  a  recursive  routine  provided  by  Cascade  Design  Automa¬ 
tion.  The  algorithm  begins  at  all  primary  outputs  and  works  backward  to  each  primary  input 
using  the  following  approach: 

for  (i=0;  i<num_outputs;  i-H-)  calc_delays(inst[i]); 

calc_delays  (inst)  { 

for  (j=0;  j<num_inputs(inst);  j-H-)  { 

if  (inst.inputlj].slew  =  -1  &&  inst.input[j].net  !=  primary_input) 

calc_delays  (inst.input[j]  .driver); 

delay  =  interpolate(inst.input|j]); 

if  (delay  >  inst.delay)  insLdelay  =  delay; 

} 

} 

When  the  data  structure  for  the  circuit  is  created,  the  various  values  for  each  in¬ 
stance  are  initialized.  For  example,  the  worst  case  delay  through  a  gate  will  be  the  input- 
output  pair  with  the  maximum  delay;  initially  the  delay  for  the  gate  is  set  to  zero.  Similarly, 
the  slew  for  each  gate  is  initialized  to  minus-one.  Referring  to  Figure  5.1,  this  simple  ex¬ 
ample  would  be  traversed  as  follows: 
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Figure  5.1  Recursive  Network  TVaversal 

1.  visit  X4.I0 

2.  visit  X3.I0 

3.  visit  X2.I0 

4.  visit  XI. 10,  calculate  delay,  slew 

5.  return  to  X2.I0,  calculate  delay,  slew 

6.  visit  XO.IO,  calculate  delay,  slew 

7.  visit  X0.I1,  calculate  delay,  slew 

8.  return  to  X2.I1,  calculate  delay,  slew 

9.  return  to  X3.I0,  calculate  delay,  slew 

10.  return  to  X4.I0,  calculate  delay,  slew 

1 1 .  visit  X4.I1,  calculate  delay,  slew 

Each  instance  may  be  visited  more  than  once,  but  by  utilizing  the  point  at  which  a 
delay  is  actually  calculated,  each  input  on  a  gate  will  be  visited  only  once.  This  engine  for 
traversing  a  design  is  the  core  for  several  of  the  utilities  which  follow. 

In  order  to  verify  the  accuracy  of  the  delay  calculator,  several  utilities  were  created 
to  automatically  generate  a  ready-to-run  sensitized  HSPICE  netlist  from  a  path  selected 
within  the  Cascade  timing  analyzer  (Tactic).  First,  a  program  based  on  the  traversal  engine 
of  the  delay  calculator  is  run  to  generate  an  ASCII  database  for  a  design.  This  database  con- 
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tains  information  about  each  instance,  such  as  the  interconnect  capacitance  at  the  output, 
the  width  of  the  transistors  that  comprise  the  gates,  and  the  net  names  for  all  inputs  and  out¬ 
puts  of  each  instance,  A  second  program  reads  this  database  and  creates  an  appropriate  data 
structure  for  the  design.  At  this  point,  paths  that  appear  in  Tactic  can  be  processed  extreme¬ 
ly  rapidly  and  an  HSPICE  netlist  for  each  path  is  created.  A  script  is  then  used  to  run 
HSPICE  and  compare  the  simulated  results  to  those  produced  by  the  delay  calculator.  For 
the  Aurora  II  design,  several  hundred  paths  of  various  lengths  (from  a  few  gates  to  greater 
than  30  gates)  were  evaluated.  In  all  cases  the  error  was  found  to  be  less  than  1 0%  and  in 
the  majority  of  cases  (>70%)  the  error  was  less  than  7%.  The  difference  came  primarily 
from  two  sources.  First,  there  is  an  inherent  error  in  both  the  mathematical  fit  involved  in 
generating  the  interpolation  tables  and  the  interpolations.  Second,  the  approach  to  delay 
calculation  just  described  does  not  consider  the  effect  of  more  than  one  input  being  high  for 
a  multiple  input  NOR  gate.  This  situation  results  in  an  output  node  being  discharged  slight¬ 
ly  faster  than  for  the  case  where  only  a  single  input  is  driven  by  a  logic  one.  Neither  of  these 
error  sources  is  very  significant  since  the  overall  accuracy  is  within  10%. 

Delays  derived  by  these  routines  are  used  in  two  ways.  First,  a  version  of  the  delay- 
calculator  is  directly  linked  into  the  static  timing  analysis  environment.  Information  about 
large  delays,  capacitances,  and  fanout  can  be  dynamically  reported  to  the  user.  Dracula- 
modelled  capacitances  can  be  incorporated  into  the  calculation  routines,  as  will  be  dis¬ 
cussed  below.  Second,  the  delay-calculator  can  run  in  a  stand-alone  mode  and  the  results 
can  either  be  back-annotated  to  any  digital  simulation  environment  (Verilog,  VHDL,  Men¬ 
tor  Graphics)  or  used  to  drive  a  buffer  sizing/selection  utility. 

5.3  Capacitance  Extraction 

The  extraction  of  parasitics  is  an  important  part  of  any  methodology  for  performing 
timing  analysis.  Parasitics  may  need  to  be  extracted  on  both  the  local  cell  level  and  the 
higher  global  level.  For  the  former,  source  code  for  a  local  cell  extractor  was  obtained  from 
Cascade  Design  Automation  and  was  modified  to  support  the  additional  interconnect  layers 
available  in  the  Vitesse  GaAs  process.  The  results  from  the  local  extractor  were  verified 
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using  DRACULA  and  the  error  was  less  than  a  few  percent.  Global  interconnect  extraction, 
however,  proved  to  be  more  difficult.  In  order  to  support  a  fast  iteration  for  timing  analysis, 
the  present  Cascade  tools  do  not  precisely  extract  global  interconnect.  Instead,  a  single  em¬ 
pirical  capacitance  value  (in  femto-farads  per  micron)  is  derived  for  each  interconnect  layer 
and  the  overall  capacitance  for  a  net  is  found  by  simply  multiplying  this  layer  constant  by 
the  net  length  that  is  routed  in  that  layer.  This  approach  can,  in  the  worst  case,  lead  to  sig¬ 
nificant  errors  in  capacitance.  Consider  a  case  where  two  wires  run  from  adjacent  sources 
to  adjacent  destinations.  One  wire  follows  an  overcell  track  across  a  datapath  and  encoun¬ 
ters  a  large  amount  of  interlayer  capacitance,  whereas  the  other  wire  is  routed  in  a  channel 
and  sees  only  the  capacitance  to  the  substrate  and  ground  plane.  In  most  process  technolo¬ 
gies,  interlayer  capacitance  for  long  wires,  especially  for  adjacent  layers,  is  more  signifi¬ 
cant  than  the  capacitance  down  to  the  substrate  or  up  to  the  ground  plane.  The  calculated 
value  will  be  the  same  for  both  wires,  when  in  fact  one  wire  may  have  a  capacitance  up  to 
100%  larger  than  the  other.  The  results  of  delay  calculation  and  subsequent  decisions  about 
critical  paths,  buffering,  and  wire  sizing  can  all  be  affected  by  inaccurate  capacitances. 

Global  extraction  is  further  complicated  by  how  the  empirical  value  for  each  layer 
is  derived.  Some  heuristic  percentage  is  applied  to  the  plate  and  fringing  capacitance  of 
each  layer,  and  the  sum  of  these  derated  values  across  all  layers  defines  the  single  empirical 
value  for  that  layer.  However,  the  nature  of  these  heuristics  are  included  in  the  router  and 
are  not  visible  to  the  user.  As  a  result,  the  layer  capacitance  values  in  the  process  file  must 
be  adjusted  until  the  calculated  capacitances  are  in  the  best  agreement  with  Dracula  gener¬ 
ated  ones.  This  was  done  by  using  two  representative  test  cases,  one  comprised  of  standard 
cells  and  the  other  of  datapath  cells.  A  script  was  written  which  reads  the  Cascade  database 
and  adjusts  the  various  layer  capacitance  parameters  until  the  best  match  with  Dracula  is 
found.  Table  5.1  contrasts  the  original  process  file  values  (obtained  from  Cascade)  with 
those  derived  from  this  approach.  Using  the  improved  layer  values  results  in  an  average  er¬ 
ror  of  about  30%  and  a  maximum  error  of  about  90%.  These  errors  are  due  to  the  inherent 
inaccuracy  in  the  approach  for  global  extraction,  but  are  acceptable  for  first-pass  timing 
analysis. 
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Table  5.1  Global  Capacitance  Tuning 


Test  Case 

Avg  Difi'erence 
with  Dracula 

i^c) 

Largest 

Difference  with 
Dracula 

i^c) 

Largest 

Capacitance 

ifF) 

Standard  Cells 
(original) 

303.9 

688.7 

0.185 

Standard  Cells 
(improved) 

38.5 

93.8 

0.302 

Datapath  Cells 
(original) 

350.3 

701.5 

0.163 

Datapath  Cells 
(improved) 

27.7 

85.7 

0.275 

To  further  improve  global  extraction,  several  utilities  were  written  which  allow 
Dracula-generated  capacitance  values  to  be  incorporated  into  the  Cascade  timing  method- 
ology.  The  main  focus  of  these  utilities  is  syntactically  matching  the  capacitances  in  a  Drac- 
ula  extracted  SPICE  netlist  with  the  corresponding  nets  in  the  Cascade  database.  Hash 
tables  and  linked-lists  are  used  extensively  to  provide  an  efficient  runtime.  Used  during  the 
latter  phase  of  the  Aurora  n  design,  these  utilities  were  in  part  responsible  for  the  close 
agreement  (within  10%)  between  predicted  clock  frequency  and  that  measured  in  testing. 
The  use  of  Dracula  capacitances  should  be  reserved  until  the  very  last  stage  of  timing  anal¬ 
ysis,  since  generating  these  capacitances  requires  a  successful  layout-versus-schematic 
(LVS)  check.  Typically,  cell  development  and  validation  proceed  in  parallel  with  the  other 
aspects  of  the  design  and  are  not  completed  until  late  in  the  design  process. 

5.4  Clock  Phase  Hazards 

A  two-phase  level-sensitive  non-overlapping  clocking  scheme  was  originally  cho¬ 
sen  for  conservative  reasons,  since  if  circuits  are  properly  designed  this  approach  ensures 
functionality  at  some  clock  frequency.  Some  design  errors  can  be  worked  around  by  adjust¬ 
ing  the  frequency  and  placement  of  clock  edges  on  a  tester  in  order  to  verify  the  function¬ 
ality  of  a  chip.  On  the  other  hand,  a  flip-flop  based  design  requires  much  better  analysis.  A 
decade  ago  it  was  reasonable  to  assume  that  gate  delay  was  much  larger  than  interconnect 
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delay  and  the  control  of  clock  skew  did  not  greatly  impact  the  functionality  of  a  design. 
Current  process  technologies  support  gate  switching  speeds  of  100  to  200  ps  and  the  com¬ 
ponent  of  overall  path  delay  due  to  interconnect  has  increased  greatly;  for  GaAs.  average 
loaded  gate  delays  of  100  to  150  ps  mean  that  RC  delay  comprises  4091-  to  60*^  of  the  over¬ 
all  path  delay.  Consequently,  much  better  analysis  of  clock  skew  is  necessar\'  to  ensure 
proper  functionality  (clearly,  skew  also  has  an  impact  on  performance).  To  illustrate  this, 
consider  Figure  5.2  which  shows  that  if  the  clock  is  delayed  to  the  second  flip-flop  due  to 
a  longer  interconnect  path,  this  flip-flop  may  incorrectly  latch  the  data  that  has  propagated 
through  the  fost  flip-flop. 

The  use  of  two  phase  clocking  introduces  a  potential  design  hazard  which  can  limit 
the  frequency  at  which  a  chip  will  operate.  Figure  5.3  shows  a  simple  representative  exam¬ 
ple,  where  signals  latched  in  a  previous  phase  are  to  be  used  to  generate  signals  that  are 
latched  in  the  subsequent  phase.  This  logic  inadvertently  uses  inputs  from  both  clock  phas¬ 
es,  resulting  in  a  reduced  active  time  for  the  Phi2  phase;  it  is  constrained  to  be  equal  to  the 
propagation  delay  through  this  logic  block.  An  easily-implemented  two-phase  non-over¬ 
lapping  clock  generator  has  an  active  period  for  each  phase  that  is  equal  to  1/4  of  the  clock 
period;  the  2  non-overlap  regions  account  for  the  other  1/2  of  the  period.  For  a  4  ns  clock 
(250  MHz),  the  active  period  would  be  1  ns.  However,  the  delay  along  a  critical  path  may 
be  almost  twice  this  amount;  the  hazard  means  that  the  active  time  for  the  Phi2  clock  will 
need  to  be  long  enough  to  accommodate  this  delay.  Several  of  these  errors  were  discovered 
in  the  Aurora  n  design  when  it  was  fabricated.  More  rigorous  nomenclature  might  facilitate 
recognizing  these  cases,  but  many  times  the  logic  is  complex  and  is  derived  from  numerous 


Figure  5,2  Clock  Skew  for  A  Flip-Flop  Based  Design 
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Figure  5.3  Clock  Hazard  for  2  Phase  Design 

intermediate  signals.  To  ensure  completeness,  an  automated  check  must  be  performed;  to 
do  the  verification,  a  utility  based  on  the  delay-calculator  traversal  engine  was  written.  Us¬ 
ing  the  recursive  algorithm  described  in  Section  5.2,  the  utility  visits  every  bit  of  every 
latch  in  the  design  only  once.  Whenever  a  latch  is  encountered,  a  second  recursive  routine 
is  used  to  travel  forward  through  the  design  in  order  to  reach  all  gates  driven  by  the  latch. 
If  a  path  terminates  at  another  latch,  information  about  this  latch  is  added  to  a  hash  table. 
After  traversing  the  entire  design,  the  tool  examines  the  hash  table  to  identify  any  occur¬ 
rences  of  two  successive  latches  being  driven  by  the  same  clock  phase.  For  reasons  which 
are  discussed  in  Section  5.7,  the  runtime  for  this  utility  can  be  quite  long.  For  example,  the 
tool  verified  the  FPU  design  in  12  hours,  identifying  20  hazards.  This  utility  is  run  only  a 
few  times  near  the  end  of  the  design  cycle,  and  so  a  longer  runtime  is  acceptable. 

5.5  Clock  Distribution  Analysis 

Level  sensitive  latches  and  phase  borrowing  (as  discussed  in  Section  5.7)  are  some¬ 
what  more  tolerant  of  clock  skew,  however  accurate  analysis  of  the  clock  distribution  net¬ 
work  is  still  critical  for  a  high  clock  rate  processor.  An  initial  starting  point  for  the 
distribution  network  was  based  on  insight  gained  during  completion  of  the  Aurora  n  de¬ 
sign.  In  the  Aurora  11  CPU,  both  clock  phases  enter  the  chip  from  the  same  side  and,  al¬ 
though  close  in  proximity,  are  separated  by  several  ground  pads.  It  became  evident  while 
testing  the  initial  design  that  cross-talk  can  be  significant  if  the  clocks  are  assigned  to  adja¬ 
cent  pad  locations.  Both  phases  are  routed  to  a  central  location  on  the  chip  where  on-chip 
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clocks  are  generated  and  driven  into  a  five-level  distribution  network.  The  non-overlapping 
clock  phases  distributed  internally  are  generated  from  the  external  clocks,  which  are  90  de¬ 
grees  out  of  phase,  using  the  following  relationships: 

Phil  =  cITl+clkl 

Phil  =  clkl  +7ir2 

In  addition,  the  sense  amplifier-based  register  file  requires  a  third  phase,  which  is 
formed  from  an  XOR  of  the  two  external  clocks.  The  overall  goal  is  to  constrain  latch-to- 
latch  skew  to  be  no  more  than  600  to  700  ps;  an  empirical  rule  limits  fanout  along  each 
branch  of  the  network  to  no  more  than  9.  Each  standard  cell  is  assumed  to  provide  a  load 
of  2  minimum-sized  enhancement  transistors  (15  microns  of  width),  while  datapath  in¬ 
stances  terminate  in  a  column  driver  (50  microns  of  width).  Very  large  buffers  are  used 
along  the  first  levels  of  the  distribution  network.  The  last  stage  of  the  network  ends  at  a  lo¬ 
cal  driver  cell,  which  for  datapath  cells  is  physically  located  in  the  first  row  of  a  column.  In 
order  to  keep  polarity  uniform  throughout  the  chip,  we  added  drivers  to  each  standard-cell 
latch  as  well.  A  final  post-processing  step  in  the  design  methodology  reduces  the  number 
of  drivers  by  merging  multiple  latches  so  that  they  share  a  common  driver. 

Several  utilities  were  written  to  help  analyze  the  clock  distribution  network.  The 
primary  one,  based  on  the  delay-calculator  traversal  routine,  generates  a  ready-to-run 
HSPICE  netlist  for  each  clock  phase  of  the  network.  It  first  builds  a  data  structure  to  repre¬ 
sent  the  circuit,  and  then  recursively  moves  outward  from  the  top-level  clock  inputs.  Each 
gate  encountered  along  the  way  is  added  to  a  hash  table;  this  proceeds  until  all  latches  that 
terminate  the  distribution  network  have  been  visited.  A  final  routine  reconstructs  just  the 
distribution  network  and  creates  a  sensitized  HSPICE  netlist  which  contains  capacitances, 
specific  gate  types,  and  fanout.  As  with  other  delay  calculator-based  utilities,  Dracula  gen¬ 
erated  capacitances  can  be  incorporated  into  the  output  netlist.  After  HSPICE  is  run  for 
both  clock  phases,  several  analysis  scripts  are  used  to  provide  different  ways  of  analyzing 
the  data.  First,  the  transit  time  delays  for  both  phases  are  sorted  and  plotted,  as  shown  in 
Figure  5.4  for  the  FPU.  A  rough  sense  of  skew  in  the  design  can  be  obtained  by  comparing 
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Figure  5.4  Sorted  Clock  Tt'ansit  Times  (No  Resistances) 
the  range  of  values  for  these  2  graphs;  a  possible  maximum  value  would  0.72  ns  (1.1  ns  of 

the  Phil  graph  minus  0.38  ns  of  the  Phi2  graph).  This  does  not  necessarily  mean  that  2  suc¬ 
cessive  latches  experience  a  skew  of  this  magnitude.  In  order  to  obtain  a  more  accurate  re¬ 
port  of  latch-to-latch  skew,  a  utility  similar  to  the  clock  phase  checking  program  was 
created.  This  program  creates  an  ASCII  database  of  all  successive  latch  pairings;  it  is  used 
in  conjunction  with  the  output  from  HSPICE  to  obtain  a  list  of  those  latch  pairs  for  which 
the  skew  exceeds  a  user-specified  threshold.  A  final  utility  can  be  used  to  generate  a  3-di- 
mensional  representation  of  clock  transit  time  versus  location  on  the  chip.  Figure  5.5  is  such 
a  plot  for  the  Aurora  n  CPU. 

Interconnect  resistance  along  the  distribution  network  can  significantly  impact 
clock  skew;  Figure  5.6  shows  the  sorted  transit  times  for  the  Phil  clock  phase  once  resis¬ 
tances  are  added;  it  is  evident  that  these  resistances  must  be  reduced.  The  problem  is  ad¬ 
dressed  by  both  increasing  the  size  of  the  wires  that  are  routed  for  the  distribution  network 
and  by  constraining  this  routing  to  use  only  the  upper  two  interconnect  layers,  which  have 
a  lower  resistivity.  Both  of  these  actions  are  initiated  by  adding  attributes  in  Floorplanner 
to  the  nets  in  question.  A  script  is  used  to  identify  problem  nets,  calculate  the  appropriate 
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Figure  5.7  Sorted  Clock  Transit  Times  (With  Final  Resistances) 

wire  sizes,  and  generate  a  second  script  of  attribute  commands  which  can  be  invoked  from 
within  Floorplanner.  When  resistances  are  reduced  to  less  than  50  ohms  per  net,  the  clock 
transit  times  are  found  to  be  reasonable,  as  shown  in  Figure  5.7.  The  analysis  of  resistance 
is  discussed  further  in  Section  5.6. 

5.6  Resistance  Extraction 

As  a  quick  and  easy  tool,  a  Perl  script  was  written  to  extract  resistances  from  a  de¬ 
sign.  The  runtime  for  the  FPU  is  2  hours,  which  is  reasonable  since  this  step  is  used  prima¬ 
rily  toward  the  end  of  the  design  cycle.  This  script  utilizes  a  Cascade  database- viewing  tool 
(Proman)  to  find  resistances  that  are  local  to  all  datapath  partitions,  and  then  proceeds  to 
extract  resistances  for  top  level  global  nets.  Since  a  net  can  be  comprised  of  numerous 
branches,  the  resistance  for  the  net  is  derived  from  the  longest  branch.  The  resistances  and 
a  summary  of  the  percentage  that  each  layer  contributes  to  the  average  net-route  are  written 
to  a  text  file,  which  is  then  used  by  a  wire  sizing  script.  This  sizing  script  makes  use  of  the 
following  relationship: 
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Resisiznce  = 


m\ratio  mXshres  m\pathlen^th 
V^'idih 


mlratio  mZshres  mZpaililen^th 
Width 


m^ratio  ■  m3shres  m^pathlength 
Width 


The  interconnect  ratios  (mlratio,  m2ralio,  mSratio)  in  this  equation  are  obtained 
from  the  resistance  extraction  script  and  represent  the  percentage  of  a  typical  net  that  is  on 
a  given  layer.  Since  it  is  important  to  minimize  the  resistance  along  the  clock  distribution 
network,  these  nets  are  routed  only  on  the  upper  two  interconnect  layers,  which  have  less 
resistance  than  the  lowest  routing  layer  (a  factor  of  two  lower  for  the  top  layer).  Clock  wires 
are  sized  by  assuming  a  worst-case  situation  where  all  routing  is  done  on  the  second  layer. 

Our  current  static  timing  analysis  methodology  does  not  consider  the  effect  of  re¬ 
sistance,  which  results  in  a  degree  of  inaccuracy  when  analyzing  critical  paths.  A  macro¬ 
model  approach  for  deriving  RC  delay  has  been  developed  but  has  not  yet  been  implement¬ 
ed  [Kayssi93b].  The  current  script-based  approach  should  be  rewritten  in  C  to  improve  per¬ 
formance  and  more  closely  couple  this  step  to  delay  calculation.  In  addition,  the  user  might 
specify  a  delay  threshold  below  which  RC  delay  is  ignored,  in  order  to  improve  the  run¬ 
time. 


5.7  Determination  of  Gate  Path  Length 


Improving  the  performance  of  a  design  is  an  iterative  process,  involving  the  follow¬ 
ing  steps: 

1.  Identify  all  situations  where  the  number  of  gates  that  lie  along  a  path  exceeds  the 
allowable  target  for  one  clock  phase.  For  a  4  ns  clock  period  (250  MHz)  and  an  aver¬ 
age  loaded  gate  delay  of  100  ps,  this  target  is  set  at  20  gates  per  clock  phase. 

2.  Reduce  the  number  of  gates  along  a  critical  path,  which  can  be  done  in  several  ways: 
a)  reimplement  a  synthesized  logic  block  by  hand,  b)  factor  out  any  late  arriving  sig¬ 
nals,  and  c)  retime  logic  by  shifting  it  into  the  previous  or  next  clock  phase.  The 
application  of  these  approaches  will  often  uncover  additional  critical  paths  and 
retiming  may  make  paths  critical  which  were  previously  acceptable.  Consequently, 


it  is  necessarv’  to  iterate  over  these  first  two  steps. 

3.  Identify  instances  where  capacitance,  resistance,  and/or  fanout  result  in  individual 
gate  delays  that  exceed  the  goal  of  lOOps  per  gate  and  lead  to  path  dela\'s  that  are 
greater  than  the  target  delay.  Approaches  to  reducing  path  delay  include:  a)  replace 
DCFL  gate  instances  with  a  buffered  counteipart,  b)  change  the  size  of  the  driving 
gate,  c)  reduce  fanout  by  using  multiple  copies  of  the  driving  gate,  d)  reduce  inter¬ 
connect  capacitance  through  better  placement,  routing,  and  cell  design,  e)  reduce 
interconnect  resistance  by  sizing  wires  and  by  constraining  the  layers  used  for  rout¬ 
ing. 

The  Cascade  timing  analyzer  (Tactic)  can  be  used  to  perform  the  analysis,  but  there 
are  several  reasons  why  this  is  not  the  best  solution  for  the  first  2  steps.  First,  Tactic  presents 
paths  in  a  graphical  manner,  which  is  beneficial  for  visually  identifying  components  that 
comprise  the  overall  path  delay,  but  tends  to  be  inefficient  for  .summarizing  just  the  gate 
depth  along  the  path.  A  text-base  report  is  more  efficient  at  this  point  since  it  gives  the  user 
flexibility  in  sorting  and  searching  through  the  list  of  paths.  Second,  support  for  a  visual 
interface  adds  overhead  to  the  runtime.  For  a  complex  design,  it  may  be  necessary  to  iterate 
many  times  over  the  first  2  steps  so  the  total  iteration  cycle  needs  to  be  no  more  than  two 
hours.  In  addition,  at  least  one  step  performed  by  the  current  version  of  Tactic  appears  to 
be  extremely  time-consuming,  requiring  several  days  for  a  large  design.  Finally,  there  are 
several  analysis  features  that  are  not  supported  by  Tactic  which  can  simplify  the  process  of 
reducing  logic  depth;  these  will  be  discussed  below. 

In  order  to  provide  an  alternative  for  identifying  and  resolving  logic  depth,  a  level- 
izer  based  on  the  delay-calculator  traversal  engine  was  written.  This  utility  was  originally 
designed  to  report  all  paths  from  inputs  to  outputs  for  synthesized  logic  blocks  and  was  later 
extended  to  identify^  all  paths  between  latches.  In  supporting  the  latter,  it  became  evident 
that  identifying  all  paths  from  the  inputs  to  the  outputs  for  designs  with  large  amounts  of 
connectivity  can  easily  become  a  computationally  intensive  task.  The  traversal  engine  en¬ 
sures  that  each  bit  of  a  latch  is  only  encountered  once  and  the  original  algorithm  for  the  lev- 
elizer  can  be  summarized  by  the  following: 
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if  (inst  =  latch)  | 


start_inst  =  inst; 
fmd_paths(inst,l); 

} 

find_paths  (inst,  level)  { 
for  (i=0;  i<num_load_instances(inst);  i-H-)  { 

if  (inst.successor_inst[i]  =  latch  or  inst.output[i]  ==  primary _output) 
addJnst_to_successor_list(start_inst,  inst.successor_inst[i],  level) ; 
else  fmd_paths(inst.successor_inst[i],  level+1); 

} 

} 

After  the  circuit  has  been  traversed,  a  post-processing  routine  is  invoked  which 
sorts  through  the  hash  table  created  by  add_inst_to_successor()  to  identify  paths  whose 
depth  exceeds  a  level  specified  by  the  user.  The  result  is  a  list  of  all  possible  paths  between 
inputs,  outputs,  and  latches.  The  requirement  of  traversing  all  inputs  to  all  outputs  can  be¬ 
come  quite  costly  for  designs  that  exhibit  a  large  degree  of  connectivity.  Consider  the 
heavily  interconnected  example  of  Figure  5.8.  Assuming  each  gate  in  a  column  drives  mul¬ 
tiple  gates  in  the  subsequent  column,  the  number  of  calls  to  find_paths()  will  be  on  the  order 
of: 


Calls  =  ( 


^  ^  NumColumns 

Gates  ^ 

Column 


The  Wallace  array  of  the  multiply  unit  is  a  good  example  of  a  highly  interconnected 
circuit;  the  runtime  for  this  design  is  approximately  12  hours.  However,  the  runtime  can  be 
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Figure  5.8  Run*time  Increase  For  High  Degree  of  Connectivity 

dramatically  reduced  by  constraining  the  algorithm  to  find  only  the  worst  case  path  be¬ 
tween  latches  and  inputs/outputs.  Doing  so  simply  requires  keeping  track  of  the  worst  level 
seen  at  any  gate;  this  version  of  find_paths()  can  be  summarized  by  the  following: 

find_paths  (inst,  level)  { 

if  (level  >  inst.level)  instJevel  =  level; 

for  (i=0;  i<num_load_instances(inst);  i-H-)  { 

if  (inst.successor_inst[i]  =  latch) 

add_inst_to_successor_list(start_inst,  inst.successor_inst[i],  level); 

else  if  ((level+1)  >  inst.successor_inst[i]  .level) 

find_paths(inst.successor_inst[i],  level+ 1 ); 

} 

} 

This  approach  prevents  the  explosive  increase  in  calls  to  find_paths()  by  keeping 
track  of  only  the  worst-case  path  to  each  instance.  Any  attempt  to  continue  expanding  out¬ 
ward  in  a  given  direction  will  be  inhibited  if  a  previous  traversal  with  a  longer  path  length 
has  already  been  encountered.  The  corresponding  runtime  for  a  design  such  as  the  multiply 
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u„„  is  noss  on  the  order  of  an  hour,  a  reduct.on  ,n  ntminte  of  an  order  of  magnuude. 

Several  additional  features  of  the  levelizer  facihtate  the  process  of  reducing  gate 
path  length.  The  use  of  a  two-phase  latch-based  clocking  scheme  allows  the  chance  to  bor¬ 
row  time  from  the  clock  phase  which  precedes  a  cnttcal  path.  Since  latches  are  transparent 
during  the  entire  active  time  of  a  clock  phase,  it  is  possible  that  the  combinational  logtc  that 
drives  the  input  to  a  latch  may  stabilise  prior  to  the  enabling  edge  of  the  clock.  This  stable 
.ignal  is  then  immediately  available  to  drive  logic  in  the  subsequent  phase.  As  a  result,  a 

critical  pathismallycomprtsedofthesumof  the  gate-depth  forthe  logic  that  occurs  tnhoth 

clock  phases.  Although  20  gates  per  phase  has  been  chosen  as  a  destgn  target,  tn  selecte 

casesthisconstraintcanberelaxediftheworst-casepamthatdrivesthecurrentcnticalpath 

has  fewer  than  20  gates.  It  is  important  to  control  clock  skew  along  these  path-parrs.  Since 
a  clock  that  arrives  late  to  the  flrst-phase  latch  wiU  reduce  the  ability  to  borrow  ttme  from 
this  phase.  The  levelizer  generates  a  3D  histogram  (Figure  5.9,  the  multiply  umt)  for 
the  x-ards  is  the  level  of  the  previous  worst-case  path,  the  y-axis  is  the  level  of  the  current 
maximum  path,  and  the  z-axis  is  die  number  of  instances  which  have  this  previous-current 
pair.  Figure  5.10  and  Figure  5.11  show  these  plots  for  the  overall  FPU  (excluding  the  func¬ 
tional  units)  and  the  add  unit;  note  that  the  count  (z-axis)  is  inhibited  to  make  the  plots  more 
readable.  These  plots  represent  the  results  from  numerous  iterations  over  tite  first  2  steps 
described  above.  An  additional  option  will  print  the  wotst-case  path  for  the  phase  which 
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Figure  5.10  FPU  Critical  Paths  of  Current  and  Previous  Phase  (excl.  FU’s) 

succeeds  the  current  one:  this  inforaiadon  is  useful  when  moving  logic  across  latches  while 
performing  manual  retiming.  The  levelizer  also  generates  a  list  of  the  50  instances  that  ap¬ 
pear  most  often  along  all  paths,  allowing  the  user  to  focus  on  the  most  troublesome  logic 
blocks. 

5.8  Post-processing  Optimization  Utilities 

Several  additional  functions  have  been  incorporated  into  a  single  post-processing 


Figure  5.11  Add  Unit  Critical  Paths  of  Current  and  Previous  Phase 
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optimization  utiliU'.  nicknamed  the  “gobbler"  due  to  its  ability  to  remove  and  transform 
logic  instances.  First,  automatic  buffer  selection  can  be  performed  by  running  the  stand¬ 
alone  delay-calculator,  which  lists  instances  whose  delay,  output  capacitance,  or  fanout  ex¬ 
ceed  a  user  specified  threshold.  This  text  database  is  then  used  by  the  optimization  tool  to 
translate  instances  having  large  delays  into  buffered  versions  with  the  same  functionality. 
Figure  5.12  compares  delay,  capacitance,  and  fanout  for  the  FPU  before  and  after  buffer 
selection.  Delays  are  improved  significantly.  Capacitance  decreases  slightly  due  to  several 
layout  optimizations  that  were  performed  in  the  second  iteration.  The  large  fanout  points 
appear  in  part  because  a  reset  distribution  tree  had  not  yet  been  implemented;  other  large 
fanout  instances  correspond  to  the  larger  delay  points.  This  comprehensive  approach  trades 
slightly  larger  area  and  power  dissipation  for  computational  efficiency.  In  other  words,  this 
approach  is  simple  to  implement  but  runs  the  risk  of  replacing  DCFL  gates  with  larger  buff¬ 
ered  versions  for  instances  which  have  individually  large  delays  but  that  do  not  extend  ap¬ 
preciably  the  overall  clock  period. 

A  second  function  of  the  optimization  utility  involves  automatically  improving  the 
gate-depth  of  logic  blocks  by  searching  for  certain  common  logic  sequences.  For  example. 
Figure  5.13  shows  a  sequence  that  occurs  frequently  within  logic  that  has  been  synthesized 
by  the  Cascade  utility  Finesse.  In  this  case,  2  levels  of  logic  can  be  removed  by  merging  the 
inputs  of  gate  1  and  gate  3  by  increasing  the  fanin  for  gate  3.  Some  other  pattern  matching 
optimizations  which  improve  either  logic  depth  or  area  due  to  fanin  are; 

1.  2  successive  inverters  can  be  removed  altogether,  assuming  they  are  not  needed  for 
buffering  reasons. 

2.  A  constant  input  can  be  factored  out  of  a  gate,  thereby  reducing  the  fanin  of  the  gate 
and/or  propagating  the  output.  For  instance,  a  logic  one  on  a  NOR  gate  can  be  passed 
on  to  the  successive  gate  as  a  logic  zero.  In  turn,  if  the  successive  gate  is  a  NOR  func¬ 
tion  this  logic  zero  can  be  factored  out,  allowing  a  reduction  of  the  fanin  of  the  NOR 
gate.  These  sequences  can  occur  as  a  result  of  poor  logic  synthesis,  or  in  redundancy 
that  can  occur  at  the  interfaces  between  logic  blocks. 
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Figure  5.12  FPU  Delay,  Capacitance,  Fanout  Before  and  After  Buffer  Selection 

3.  A  gate  can  be  replaced  by  its  dual,  as  in  the  case  of  a  NOR  gate,  whose  inputs  are 
complemented,  being  replaced  with  an  AND  gate. 

4.  Gates  which  are  driven  by  the  same  inputs  can  be  merged  into  a  single  gate,  assuming 
fanin  constraints  are  not  violated. 
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Figure  5.13  Pattern  Matching  Logic  Optimization 

Care  must  be  taken  to  not  inadvertently  remove  redundant,  yet  necessary,  logic.  The 
various  stages  of  the  clock  distribution  network  form  buffer-pair  sequences  which  do  not 
serve  a  functional  purpose  but  are  necessary  for  timing  reasons.  To  avoid  problems,  this 
utility  recognizes  a  “no_touch”  property  that  can  be  attached  to  special  instances  that  are  to 
be  excluded  from  optimization.  The  generation  of  a  logic  one  in  GaAs  is  another  such  ex¬ 
ample.  A  voltage  of  Vdd  on  the  gate  will  forward  bias  the  gate  channel  diode,  will  probably 
destroy  the  inverting  transistor,  and  may  result  in  an  incorrect  value  for  the  output  of  the 
gate.  A  buffer  whose  input  is  tied  to  ground  is  typically  used  to  generate  a  logic  one.  It  is 
important  that  this  buffer  not  be  removed  from  the  design  during  the  propagation  of  con¬ 
stants;  the  “no_touch”  property  is  used  to  ensure  this. 

A  final  function  of  the  optimization  program  involves  merging  standard  cell  latch 
drivers.  As  was  discussed  earlier,  a  simpler  design  style  with  regard  to  clock  polarity  results 
if  standard  cells  use  clock  buffers  to  match  the  polarity  at  the  column  drivers  for  datapath 
latches.  This  driver  buffers  the  clock  locally  and  decodes  any  select  signals  for  merged  log¬ 
ic-latch  cells.  Whereas  a  column  driver  is  amortized  across  the  many  bits  of  a  datapath  col¬ 
umn,  it  is  costly  to  have  a  single  driver  for  every  standard  cell  latch.  This  tool  therefore 
merges  the  drivers  for  multiple  latches  into  just  one  instance,  constrained  by  a  user  speci¬ 
fied  Tnaximiim  fanout.  For  the  FPU,  this  optimization  removed  approximately  900  latch 
drivers. 


5.9  Miscellaneous  Utilities 

Several  small  tools  were  written  to  address  issues  such  as  beta  ratio  checking,  power 
rail  sizing,  and  determination  of  power  dissipation.  The  beta-ratio  for  a  DCFL  gate  sets  both 
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the  speed  and  noise  margins,  and  a  mistake  in  a  cell  can  prevent  proper  funciionalit\ ,  To 
verify  beta-ratios,  a  version  of  the  delay-calculator  traversal  engine  was  used  to  ensure  that 
all  versions  of  all  gate  types  in  a  design  have  been  checked  at  least  once. 

Power  rails  for  datapath  cells  need  to  be  sized  such  that  the  voltage  drop  to  any  cell 
in  the  column  is  acceptable.  A  Perl  script  was  written  which  will  query  the  design  database 
in  order  to  identify  for  each  cell  the  sum  of  the  widths  of  all  enhancement  and  depletion 
transistors  that  are  connected  to  Vdd.  Several  empirical  relationships  for  current  as  a  func¬ 
tion  of  transistor  width,  generated  by  iterative  HSPICE  runs,  are  used  to  estimate  the  cur¬ 
rent  for  each  cell.  This  information  is  then  used  in  conjunction  with  the  following 
relationship  in  order  to  derive  the  width  of  the  Vdd  rail  for  each  cell; 


mdthy^^  = 


pitch  ■  Rsh  ■ 

- trails 


pitch  =  The  maYlmiiTn  height  of  a  datapath  in  microns;  usually  this  is  the  maxi¬ 
mum  bit-width  (32  or  64)  times  the  effective  cell  pitch  (60  to  70 
microns) 

=  The  sheet  resistance  for  metal3,  the  layer  used  for  routing  Vdd 
Nu-  5=  The  maximum  bit- width  for  all  instances  of  this  cell 

bits 

htit  =  The  cell  current  derived  from  empirical  relationships 
IR  =  The  maximum  allowable  voltage  drop  along  the  Vdd  rails 

max 

DR  =  A.  derating  factor  that  takes  into  account  both  the  fact  that  a  column  is  con¬ 
nected  on  the  top  and  bottom  to  global  Vdd  and  the  fact  that  current 
decreases  with  distance  down  the  column;  generally  a  value  of  four  is 
used 

N  ,  =  The  number  of  Vdd  rails  within  a  cell 

rails 

This  script  also  generates  an  estimate  for  total  power.  Since  the  empirical  relation¬ 
ships  do  not  consider  the  dynamic  nature  of  power  for  buffer  cells,  this  estimate  should  be 
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somewhat  inaccurate.  A  more  realistic  value  for  power  would  be  obtained  by  deriving  an 
empirical  relationship  for  dynamic  buffer  power  as  a  function  of  interconnect,  fanout,  and 
frequency.  The  delay-calculator  traversal  engine  could  be  used  to  calculate  the  interconnect 
and  fanout  on  a  per-instance  basis. 

5.10  Observations  About  Verification 

Verification  consumes  an  ever-increasing  share  of  design  time  as  processors  be¬ 
come  more  complex.  Both  functionality  and  performance  must  be  verified  in  a  computer 
design  before  it  is  fabricated.  Timing  analysis  has  been  discussed  as  a  means  of  improving 
cycle  time.  However  a  design  that  is  functionally  correct  may  incorporate  errors  that  limit 
performance  by  causing  additional  cycles  to  be  executed.  For  instance,  consider  the  head 
and  tail  pointers  used  to  index  into  a  reorder  buffer.  If  these  pointers  are  not  advanced  cor¬ 
rectly,  it  is  possible  to  operate  in  such  a  way  that  entries  are  not  all  allowed  to  be  simulta¬ 
neously  active.  Instructions  will  execute  correctly,  but  more  issue-stalls  will  occur  since  the 
reorder  buffer  will  appear  to  be  full  more  often  than  it  should.  Similarly,  an  inefficient  im¬ 
plementation  of  a  state  machine  might  require  unnecessary  state  transitions  and  add  extra 
cycles  of  latency  to  an  operation.  Identifying  such  errors  involves  comparing  cycle  counts 
for  a  high-level  architectural  model  with  those  of  the  actual  structural  implementation.  A 
compact  loop  which  increments  a  certain  memory  location  some  number  of  times  can  be 
used  to  detect  such  errors. 

Among  other  issues,  functional  verification  requires  a  choice  of  simulation  environ¬ 
ment.  We  have  found  that  60%  to  70%  of  all  bugs  are  identified  using  tests  that  are  hand¬ 
generated  by  the  designer.  Since  these  tests  tend  to  be  fairly  compact,  an  environment  such 
as  Verilog  with  a  simulation  speed  of  several  cycles  per  second  is  appropriate.  Random 
testing  should  find  20%  to  35%  of  errors.  Since  many  millions  of  instructions  will  need  to 
be  run  for  these  tests,  a  compiled  code  simulator  such  as  VCS,  which  offers  a  simulation 
speed  on  the  order  of  30  to  50  cycles  per  second,  is  recommended.  Running  verification  in 
parallel  on  10  machines  for  4  months  would  allow  approximately  5  billion  cycles  (a  few 
billion  instructions)  to  be  run.  Eventually,  actual  application-  and  OS-code  needs  to  be  ver- 
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ified;  for  this,  a  hardware  emulation  system  would  be  appropnate.  This  approach  offers 
simulation  speeds  between  500  Hertz  and  a  few  kilo-Heitz.  While  the  use  of  compiled-code 
simulation  can  be  fairly  transparent  to  the  user,  hardware  emulation  tends  to  involve  more 
man-power.  At  the  present  time,  software  tools  that  support  hardware  emulation  are  some¬ 
what  immature  and  inefficient,  and  it  is  not  uncommon  for  small  design  changes  to  require 
tum-around  times  of  several  days. 

Functional  verification  of  the  PT’U  depended  heavily  on  randomly  generated  tests 
to  exercise  individual  functional  units,  the  full  chip,  and  the  ITU  in  the  Aurora  III  system. 
Functional  units  were  verified  though  the  use  of  randomly  generated  operations,  operands, 
and  rounding  modes.  A  specific  computation  was  performed  on  both  the  compiled  Verilog 
machine  model  and  on  an  actual  workstation.  Data  results  and  exceptions  were  compared 
and  discrepancies  were  highlighted.  Separate  test  modes  were  created  for  each  of  the  func¬ 
tional  units  to  allow  errors  found  during  random  testing  to  be  quickly  rerun  on  an  individual 
basis.  In  all,  between  5  and  10  million  operations  were  run  successfully  for  all  functional 
units.  Verification  of  a  commercial  chip  would  have  to  be  more  extensive,  with  many  bil¬ 
lions  of  operations  being  simulated.  The  level  of  verification  performed  was  considered  ap¬ 
propriate  for  the  resources  of  an  academic  effort.  Hardware  emulation  would  provide  the 
best  support  for  more  extensive  validation. 

As  mentioned,  chip-level  verification  was  done  in  a  random  test  generation  envi¬ 
ronment.  A  behavioral  model  for  the  IPU  was  created  to  feed  the  FPU  with  instructions  and 
data.  Tests  are  generated  firom  35  primitive  instruction  sequences.  In  general,  there  is  one 
primitive  for  each  piece  of  functionality  in  the  FPU,  such  as  add,  multiply,  or  divide  instruc¬ 
tions.  A  typical  sequence  will  execute  a  floating-point  instruction  and  compare  the  result  to 
the  known  correct  value,  making  the  tests  self-checking.  The  simulation  environment  loads 
several  queues  at  the  start  of  a  test  with  the  correct  outcome  for  floating-point  comparisons 
and  stores.  The  primitives  are  randomly  tiled  together  and  a  variable  number  of  NOPS  are 
added.  The  program  that  results  is  run  on  the  compiled-code  version  of  the  simulator.  In¬ 
stead  of  requiring  this  FPU  test  environment  to  read  an  actual  program  binary,  pseudo  in¬ 
structions  are  used  and  the  resulting  program  consists  of  ASCII  hex  values.  This  approach 
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to  testing  offers  several  advantages.  First,  because  the  tests  are  self-checking,  there  is  no 
need  to  compare  results  against  a  separate  high-level  behavioral  model  of  the  design.  With 
limited  resources,  it  is  difficult  to  support  efforts  to  design  separate  trace-driven,  behavior¬ 
al,  and  structural  representations  of  a  design.  Second,  each  random  test  is  fairly  shon.  on 
the  order  of  several  hundred  instructions,  making  it  easy  to  locate  errors.  Even  short  pro¬ 
grams  typically  involve  several  thousand  instructions,  making  it  difficult  to  identify  errors. 
Finally,  each  test  is  self-contained,  which  allows  verification  to  be  infinitely  parallelizable. 
The  only  limitations  are  the  number  of  available  workstations  on  a  network  and  the  number 
of  licensed  copies  of  the  simulation  environment.  Starting  from  the  moment  random  veri¬ 
fication  commenced,  over  200  million  floating-point  instructions  were  run  successfully. 

Several  observations  about  random  verification  were  made  in  our  experience  with 
the  FPU.  For  example,  if  the  primitives  used  are  not  short  enough,  the  full  benefit  of  tiling 
the  primitives  is  not  realized  and  instead  the  only  benefit  will  be  based  on  the  number  of 
permutations  that  are  derived  from  inserting  noops.  In  our  initial  tests,  each  primitive  start¬ 
ed  with  several  load  instructions  to  initialize  the  operands.  This  meant  that  in  the  tests,  each 
primitive  was  succeeded  by  a  load  instruction,  rather  than  all  possible  permutations.  A  con¬ 
stant  header  block  was  therefore  added  to  each  random  test.  Within  this  header,  all  operands 
used  for  computations  and  all  results  used  for  self-checking  verification  were  loaded  into 
the  first  23  registers  of  the  floating-point  register  file.  The  remaining  9  registers  were  used 
for  dynamic  results.  At  first,  a  floating-point  compare  instruction  always  ended  each  prim¬ 
itive  and  tiling  only  resulted  in  testing  pairings  of  compare  instructions  with  other  instruc¬ 
tions.  Again,  not  all  permutations  were  being  allowed.  The  solution  made  use  of  the 
following  sequence: 

1.  Buffer  the  result  of  the  last  compare  in  a  primitive. 

2.  If  the  destination  register  of  the  first  instruction  of  the  next  primitive  is  the  source  of 
this  compare,  issue  the  compare  and  then  the  new  instruction,  otherwise  issue  the 
new  instruction  followed  by  the  compare.  To  reduce  the  occurrence  of  the  former 
condition,  the  source  operands  of  the  compare  are  always  registers  Fa  and  Fb,  and 
the  destination  of  the  initial  instruction  is  always  register  Fc. 
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The  last  form  of  verification  was  performed  at  the  system  level,  where  the  IPL  and 
FPU  are  connected.  The  primitives  used  in  this  simulation  exercise  the  same  functionality 
as  in  the  chip-level  tests,  but  now  consist  of  actual  MIPS  instructions.  This  environment 
serves  to  more  comprehensively  test  the  actual  interface  between  the  two  chips.  Since  the 
models  for  both  chips  are  fully  realized,  all  functionality  is  simulated,  including  cache 
misses.  As  a  result,  instruction  and  data  misses  lower  the  throughput  of  floating-point  in¬ 
structions. 

Random  verification  offers  an  efficient  means  of  resolving  the  majority  of  bugs  in 
a  design.  However,  it  fails  to  provide  complete  coverage.  This  is  due  to  several  reasons. 
First,  a  designer  will  still  tend  to  include  assumptions  in  the  infrastructure  for  random  test¬ 
ing  that  leave  certain  areas  untested.  For  example,  using  random  testing  for  the  verification 
of  a  functional  unit  needs  to  consider  different  regions  of  operation.  The  add  unit  is  com¬ 
prised  of  different  pieces  of  logic  that  are  each  exercised  depending  on  whether  an  align¬ 
ment  of  one,  no  alignment,  or  an  alignment  of  greater  than  one  is  necessary.  Failure  to  force 
operands  to  fall  equally  within  each  of  these  regimes  would  result  in  inadequate  coverage. 

Random  testing  makes  it  difficult  to  discover  unusual  boundaiy  cases  that  are  de¬ 
pendent  on  the  occurrence  of  a  sequence  of  events,  such  as  an  instruction  miss  which  occurs 
for  a  branch,  followed  by  a  page-fault  in  the  delay  slot,  and  while  this  fault  is  being  ser¬ 
viced,  an  interrupt.  A  logic  error  might  result  in  the  wrong  program-counter  being  selected 
by  the  time  the  delay-slot  instruction  is  finaUy  executed.  Such  a  sequence  of  events  can  be 
quite  difficult  to  generate,  and  even  if  several  billion  instructions  are  executed,  the  error 
may  never  be  encountered.  As  designs  become  more  complex,  interest  has  developed  in 
pursuing  more  formal  approaches  to  verification. 

5.11  Future  Work;  A  Methodology  for  Automatic 
Logic  Optimization 

This  section  will  summarize  several  ideas  for  an  automated  approach  to  performing 
timing  analysis  and  logic  optimization;  many  of  the  individual  steps  in  this  methodology 
have  been  mentioned  in  the  preceding  sections.  During  the  course  of  completing  the  timing 
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phase  of  the  Aurora  ID  IPU  and  FPU  designs,  it  became  clear  that  much  effort  \va^  being 
spent  on  the  same  set  of  tasks,  each  of  which  is  well  defined  and  reasonably  straight-for¬ 
ward  to  implement  automatically.  Pan  of  the  challenge  in  implementing  an  automated  ver¬ 
sion  of  this  methodology  centers  on  how  to  iterate  between  these  steps.  A  summary’  of  the 
methodology  is: 

1.  Levelization  is  performed  in  order  to  identify  paths  which  exceed  the  targeted  gate 
depth.  Logic  can  be  optimized  in  the  following  3  ways; 

a.  Optimal  synthesis  for  logic  which  has  fewer  than  10  inputs/outputs,  perhaps  using 
some  sort  of  exhaustive  branch-and-bound  algorithm.  The  majority  of  logic  within 
a  design  is  in  blocks  having  few  inputs  and  producing  few  outputs  (refer  to 
Table  5.2).  Current  synthesis  tools  often  produce  results  which  are  far  from  opti¬ 
mal;  this  problem  is  made  worse  by  the  limited  library  of  gates  available  in  GaAs. 
Table  5.3  compares  several  logic  blocks  that  were  synthesized  using  GaAs  and 
CMOS  cell  libraries,  where  the  latter  has  access  to  more  complex  gate  structures. 
The  GaAs  implementation  often  has  many  more  levels  than  the  CMOS  version, 
and  even  for  CMOS,  the  result  is  often  several  levels  away  from  being  optimal.  The 
poor  results  for  GaAs  appear  to  be  due  to  the  technology  mapping  phase  of  the  syn¬ 
thesis  process.  Similar  results  have  been  found  using  MIS,  another  synthesis  tool. 
The  need  to  redo  large  amounts  of  logic  by  hand  is  time  consuming,  and  the  dis¬ 
covery  of  a  bug  in  the  definition  of  the  logic  may  mean  that  the  same  logic  must  be 
regenerated  several  times.  Whereas,  the  original  behavioral  description  for  the 
logic  may  be  easy  to  understand,  the  subsequent  structural  implementation  may  be 
quite  difficult  to  follow. 

b.  Factor  out  late-arriving  signals  through  the  use  of  a  mux-reduction  technique.  Fig¬ 
ure  5. 14  demonstrates  this  idea.  Signal  A,  which  might  be  the  carry-out  of  an  adder, 
does  not  arrive  until  late  in  the  current  phase.  However,  the  original  synthesized 
logic  is  not  aware  of  this  fact  and  signal  A  is  factored  into  the  first  level  of  the  logic 
block.  To  solve  this  problem,  two  separate  descriptions  for  the  logic  are  used;  one 
assumes  input  A  is  a  zero  and  the  other  assumes  input  A  is  a  one.  The  output  from 
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Table  5.2  Average  Inputs  per  Output  for  Control  Logic 


elogic 

3.600 

Iqtagmemcontrol 

4.000 

sbconirol 

2.000 

fpucontrol 

15.840 

precisetagmemcontrol 

4.286 

predecodelogicfpnew 

8.840 

stickybit 

grsnew 

7.857 

Table  5.3  Gate-Depth  for  CMOS  versus  GaAs  Logic  Synthesis 


Control  Block 

No.  Gates 
(CMOS/ 
GaAs) 

No. 

Outputs 

Avg 

Difference 
in  Gate- 
Depth 

convertcontrol 

68/102 

21 

0.476 

iqcontrol 

77/140 

25 

1.080 

robcontrol 

217/384 

117 

0.333 

elogic 

20/50 

15 

1.733 

Iqtagmemcontrol 

41/77 

24 

1.250 

sbcontrol 

444/568 

288 

0.000 

fpucontrol 

757/  1312 

169 

0.420 

precisetagmemcontrol 

78/128 

42 

1.429 

predecodelogicfpnew 

236/395 

25 

1.440 

stickybit 

133/231 

51 

0.353 

grsnew 

150/269 

35 

1.857 

Avg  Gate- 
Depth 
(CMOS  / 
GaAs) 

Max 

Gate- 

Depth 

(CMOS 

/GaAs) 

3.476/3.952 

6/8 

7.400/8.480 

11/13 

3.282/3.615 

8/8 

3.333  /  5.067 

4/6 

3.000/4.250 

5/6 

1.889/  1.889 

2/2 

6.083/6.503 

15/15 

3.000/4.429 

6/8 

6.640/8.080 

13/12 

5.529/5.882 

10/10 

6.429/8.286 

9/12 
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each  description  is  fed  into  a  multiplexor,  whose  select  line  is  simply  the  late  arriv¬ 
ing  signal.  In  this  way,  only  3  gate  levels  ( 1  for  decoding.  2  for  the  multiplexor)  are 
added  to  the  path-depth  of  the  logic  that  generates  signal  A.  Synthesis  is  still  used 
to  generate  the  A=0  and  A=1  signals,  and  because  much  of  the  logic  is  shared 
between  these  signals  there  is  only  a  slight  increase  in  instance  count  compared  to 
the  original  version.  Doing  this  transformation  by  hand  is  tedious  and  makes  the 
resulting  description  more  difficult  to  interpret. 

c.  Utilize  a  post-processing  optimization  phase  in  order  to  pattern  match  commonly 
occurring  logic  sequences.  Propagation  of  constants,  redundant  logic  at  the  inter¬ 
faces  between  logic  blocks,  and  poor  synthesis  can  all  be  addressed  in  this  way. 


2.  Repartition  logic  across  latch  boundaries.  This  will  involve  moving  small  pieces  of 
logic  to  either  the  previous  or  next  clock  phase.  This  step  might  operate  directly  on 
the  design  database  or  generate  appropriate  Verilog  code  that  can  be  integrated  into 
the  top  level  design.  In  either  case,  the  user  should  be  made  aware  of  the  changes 
being  made. 

3.  Perform  the  closely-coupled  steps  of  parasitic  extraction,  delay  calculation,  static 
timing  analysis,  and  buffer  sizing/selection.  This  stage  analyzes  the  delay  of  critical 
paths  and  iterates  primarily  by  choosing  to  size  an  existing  driver  or  to  select  a  buff¬ 
ered  gate  type  for  those  instances  that  experience  a  long  delay  due  to  fanout  or 
interconnect  parasitics.  Wires  along  critical  paths  which  have  large  resistances 


late  A  BCD 


0  BCD 


BCD 


Figure  5.14  Factoring  Late  Arriving  Signals  >Ta  Mux-Reduction 


should  be  resized  automatically,  and  this  RC  delay  information  should  be  included 
in  path  delays. 

Automation  of  this  sequence  of  steps  will  save  much  time  on  tasks  which  are  te¬ 
dious  to  do  by  hand.  Since  verification  and  performance  optimization  often  occur  in  paral¬ 
lel,  more  time  can  be  lost  if  bugs  require  repetition  of  some  of  the  steps.  Further,  doing  these 
steps  by  hand  makes  the  high-level  Verilog  description  of  a  design  more  difficult  to  inter- 


CHAPTER  6 
Conclusion 


This  chapter  summarizes  the  work  and  contributions  presented  in  this  thesis  and 
discusses  several  related  future  areas  of  research. 

6.1  Summary  of  GaAs  Technology 

Many  of  the  design  constraints  for  GaAs  DCFL,  including  small  noise  margins,  low 
threshold  voltages,  sensitivity  to  voltage  drops  along  ground  distribution,  and  over-driving 
of  inputs  along  highly  capacitive  nets,  are  a  result  of  the  Schottky  diode  gate  of  MESFET 
transistors.  We  addressed  the  over-driving  problem  with  the  use  of  a  feedback  buffer, 
which  provides  a  large  dynamic  current  for  charging  a  wire  and  a  smaller  static  current  for 
dc  operation.  The  benefit  of  fast  gate  switching  speeds  in  DCFL  tends  to  be  offset  by  great¬ 
er  gate-depth  that  results  from  a  NOR-NOR  logic  topology.  Transistor  source  resistance  is 
large  enough  to  limit  the  use  of  both  stacked-transistor  gates  and  pass  gates,  especially  in 
light  of  small  noise  margins.  Leakage  currents  are  several  orders  of  magnitude  larger  than 
for  silicon  devices  and  constrain  the  maximum  fanin  of  a  DCFL  gate  to  four  inputs.  In  ad¬ 
dition,  this  issue  affects  the  density  of  SRAM  components  by  limiting  the  amortization  of 
sense  amplifiers  across  rows  of  an  array.  The  use  of  dynamic  logic  is  impractical  due  to  the 
current  that  flows  into  the  gate  of  a  MESFET  transistor.  Gate  current  also  limits  the  degree 
of  fanout  that  can  be  supported  by  either  DCFL  or  buffered  gates.  In  current  GaAs  process¬ 
es,  interconnect  has  larger  dimensions  than  in  current  CMOS  processes.  The  importance  of 
improving  metal  technology  in  GaAs  was  illustrated  by  contrasting  the  expected  improve¬ 
ment  to  that  of  improving  gate  delays. 

The  current  performance  gains  of  GaAs  DCFL  process  technology  are  not  signifi- 
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cant  enough  to  comp>el  a  current  CMOS  design  effort  to  be  retargeted  for  GaAs.  Integration 
levels  are  lower  than  CMOS  by  a  factor  of  5  to  10.  Much  of  this  is  due  to  the  characteristics 
of  ratioed  logic,  large  leakage  currents,  and  less  efficient  interconnect  pitches,  rather  than 
to  a  difference  in  minimum  transistor  feature  size.  In  terms  of  operating  frequenc\'.  GaAs 
does  offer  faster  loaded  gate  speeds,  but  this  benefit  is  tempered  by  DCFL  providing  a 
smaller  range  of  circuit  design  options  and  by  needing  longer  gate-depths  to  produce  the 
same  functionality.  Power  dissipation  at  high  frequencies  has  often  been  claimed  as  an  ad¬ 
vantage  of  GaAs.  The  static  relationship  of  power  dissipation  in  GaAs  is  contrasted  with 
the  strong  dynamic  component  in  CMOS  and  suggests  a  cross-over  point  for  frequency  at 
which  GaAs  becomes  more  power  efficient.  However,  several  trends  in  CMOS  design  have 
weakened  this  argument.  Power  supply  voltages  for  CMOS  have  dropped  from  5  volts  to 
3.3  volts  and  lower,  which  has  raised  the  cross-over  point.  More  importantly,  there  has  been 
an  effort  to  reduce  power  by  turning  off  the  clock  to  logic  blocks  which  are  not  being  ac¬ 
tively  utilized.  Consequently,  logic  transitions  and  resulting  dynamic  power  dissipation  are 
reduced.  This  approach  is  not  nearly  as  effective  for  GaAs,  in  light  of  the  strong  static  com¬ 
ponent  to  power  dissipation.  Complementary  GaAs  (C-GaAs)  offers  the  potential  of  ad¬ 
dressing  some  of  these  issues,  but  with  the  caveat  of  being  unproven  at  the  present  time.  C- 
GaAs  will  support  limited  stacked  transistor  structures,  which  should  reduce  gate-depth 
along  critical  paths.  Power  dissipation  for  C-GaAs  circuits  should  be  lower  since  there  is 
no  dc  path  between  power  and  ground.  Current  will  still  flow  into  the  gate  of  a  transistor 
and  contribute  a  static  component  to  power,  but  this  is  substantially  less  than  for  DCFL. 

It  seems  unlikely  that  in  the  short  term  GaAs  DCFL  will  become  a  significant  alter¬ 
native  to  CMOS  for  large  VLSI  designs.  Consider  Table  6.1,  which  shows  a  projection  for 
DCFL  and  C-GaAs.  In  addition  to  a  process  improvement  of  30%  for  loaded  gate  delay,  the 
DCFL  projection  assumes  that  industrial-level  resources  are  applied  to  the  project;  these  re¬ 
sources  would  result  in  both  several  gate-levels  being  removed  from  all  critical  paths  and 
an  increase  in  density  that  translates  into  a  20%  gain  in  clock  frequency.  For  C-GaAs,  a 
more  deeply  pipelined  machine  might  achieve  a  logic-depth  along  critical  paths  of  15  gates 
per  clock  cycle.  However,  a  more  deeply  pipelined  design  can  suffer  worse  memory  system 
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Table  6.1  Performance  Projections  for  Ga.4s  and  C-GaAs 


Technolog> 

Avg  Loaded 
Gate  Delay 

(ps) 

Gate-depth  per 
Clock  Cycle 

Clock 

Frequency 

(MHz) 

DCFL(1994) 

100 

40 

250 

DCFL(1995) 

80 

40 

310 

IX:FL(1997) 

50 

36 

500 

C-GaAs(1997) 

Conservative 

100 

22 

450 

C-GaAs{1997) 
Optimistic  #1 

70 

22 

650 

C-GaAs(1997) 
Optimistic  #2 

70 

15 

952 

latency,  will  require  more  latches,  which  may  constrain  clock  frequency,  and  will  be  more 
sensitive  to  miss-predicted  branches.  C-GaAs  may  provide  a  level  of  density  for  SRAM 
components  that  is  appropriate  to  support  a  combination  of  on-chip  caches,  register  blocks, 
and  branch-prediction  techniques  used  to  reduce  miss-predicted  branches.  A  recent  Ijiim  C- 

GaAs  4K-bit  SRAM  provides  a  density  of  11,600  transistors  per  mm^,  an  access  time  of 
5.3ns,  and  a  power  dissipation  of  16.2mW  [Hallmark94].  In  general,  a  target  logic-depth  of 
15  gates  per  cycle  will  be  a  challenging  goal  for  either  an  academic  or  industrial  design  ef¬ 
fort.  In  comparison,  CMOS  processors  have  currently  been  demonstrated  to  operate  at 
clock  frequencies  in  the  range  of 200MHz  to  300MHz;  at  the  traditional  growth  rate  of  50% 
per  year,  it  is  reasonable  to  expect  the  introduction  within  the  next  two  years  of  CMOS  pro¬ 
cessors  with  clock  frequencies  greater  than  600MHz.  The  basic  limitations  just  discussed 
for  GaAs  may  not  be  due  to  technical  reasons  as  much  as  to  economic  ones.  In  the  absence 
of  volume  commercial  products  fabricated  in  GaAs,  there  might  not  be  sufficient  financial 
resources  to  advance  the  technology  at  a  pace  comparable  to  that  of  CMOS.  An  alternative 
technology  probably  needs  to  offer  at  least  twice  the  performance  in  order  to  attract  design¬ 
ers  away  from  the  large  infrastructure  and  expertise  currently  available  in  silicon  technol¬ 
ogy. 


176 


6.2  Summar}'  of  FPU  Architectural  Issues 

The  architectural  study  is  based  on  a  trace-driven  simulator  which  was  extended  to 
describe  the  Aurora  HI  integer  and  floating-point  architecture.  The  simulator  produces  per¬ 
formance  metrics,  including  average  instruction  latencies,  dynamic  instruction  frequencies, 
basic  block  size,  bus  utilization,  average  degree  of  issue,  sizes  for  different  resources,  stall 
sources,  and  cycles-per-instruction.  Simulations  were  run  using  traces  from  the  SPECfp92 
benchmarks  for  a  wide  variety  of  architectural  features. 

The  first  of  these  experiments  examined  three  issue  policies  for  floating-point  in¬ 
structions,  1010, 1000,  0000  (the  first  and  second  pairs  of  characters  mean  in-order  or 
out-of-order  for  issue  and  completion  of  instructions).  A  number  of  conclusions  were 
drawn  about  the  resource  cost  and  performance  benefit  of  each.  The  first  policy  is  the  sim¬ 
plest  and  achieves  the  worst  CPI,  although  it  does  support  a  faster  clock  frequency  by  elim¬ 
inating  the  need  for  a  reorder  buffer  and  by  simplifying  issue  logic.  The  second  policy, 
which  utilizes  more  parallelism  by  allowing  results  to  complete  out  of  order,  offers  approx¬ 
imately  12%  better  SPECfp92  performance  for  a  moderate  increase  in  resources  and  design 
complexity.  Dual-transfer  and  dual-issue  of  floating-point  instructions  were  found  to  offer 
a  combined  15%  improvement  in  CPI  versus  a  single  issue  approach.  The  third  policy, 
OOOO,  attempts  to  increase  look-ahead  capability  by  moving  a  data-dependent  instruction 
past  the  decode  phase,  into  an  instruction  window  which  resides  between  the  decode  and 
execute  units.  Ideally,  this  should  allow  a  greater  opportunity  to  find  instructions  without 
dependencies.  However,  the  realized  performance  gains  are  small  (a  few  percent),  due  pri¬ 
marily  to  an  increased  utilization  of  the  reorder  buffer.  While  most  instruction  types  pro¬ 
duce  a  result  which  can  be  used  immediately  upon  writing  the  reorder  buffer,  floating-point 
comparison  and  store  instructions  must  wait  until  the  corresponding  entry  reaches  the  head 
of  the  reorder  buffer  in  order  to  ensure  precise  handling  of  memory  exceptions.  Most  of 
these  synchronization  stalls  are  due  to  branch-on-compare  sequences.  An  0000  policy  re¬ 
sults  in  more  instructions  being  active,  and  a  corresponding  increase  in  the  number  of  reor¬ 
der  buffer  entries  that  precede  a  floating-point  compare,  which  in  mm  reduces  the  intended 


benefit  of  this  issue  policy.  Further,  the  resources  required  by  an  0000  policy  are  quite 
substantial;  two  reservation  station  entries  per  functional  unit  can  add  259f  to  overall  chip 
area.  In  other  words,  the  additional  resources  needed  to  implement  0000  are  equivalent 
to  the  area  difference  between  a  2-cycle  pipelined  multiply  unit  and  a  5-cycle  iterative  unit. 
The  performance  difference  for  this  multiply  unit  trade-off  is  10%,  w'hich  is  an  improve¬ 
ment  not  matched  by  switching  from  1010  to  0000.  Several  additional  aspects  of  an  out- 
of-order  issue  policy  are  discussed,  including  design  complexity  and  approaches  for  select¬ 
ing  instructions  to  be  issued  from  a  reservation  station.  All  initial  simulation  experiments 
were  run  with  large  resources  in  order  to  identify  an  upper  bound  for  performance.  Infor¬ 
mation  from  these  runs  was  used  to  choose  a  more  reasonable  allocation  of  resources  which 
results  in  only  a  small  degradation  in  performance  (a  few  percent)  from  the  ideal  case  of 
unlimited  resources. 

Although  the  FPU  with  the  chosen  issue  policy  (1000)  and  configuration  generates 
few  stall  cycles  due  to  the  sizes  of  various  resources,  the  other  major  stall  source,  branch- 
on-FPU  comparisons,  was  found  to  cause  a  significant  number  of  stalls  on  some  bench¬ 
marks.  A  set  of  instruction  sequences,  consisting  of  a  floating-point  load  followed  by  a 
compare  and  then  a  branch  which  depends  on  the  result  of  the  compare,  was  found  to  be 
common  among  these  programs.  A  number  of  approaches  were  discussed  for  reducing  the 
large  latency  associated  with  the  compare  instruction,  including:  moving  ahead  in  time  the 
transfer  point  for  floating-point  instructions,  improving  the  primary  memory  system,  and 
reordering  code  for  these  sequences  in  order  to  place  more  useful  unrelated  work  between 
the  load,  compare,  and  branch  instructions.  All  of  these  issues  were  investigated  in  the  con¬ 
text  of  both  single-  and  dual-issue  designs.  The  best  design  point  that  resulted  would  reduce 
compare  latency  to  simply  the  time  needed  to  perform  the  comparison  and  transmit  the  re¬ 
sult  back  to  the  IPU;  the  overall  performance  gain  was  approximately  10%,  with  some  in¬ 
dividual  programs  improving  by  as  much  as  23%. 

Integration  levels  are  low  in  GaAs  and  several  techniques  for  improving  memory 
system  performance  were  discussed,  including  both  greater  bandwidth  through  the  support 
of  double-word  load  and  store  instructions,  and  data  prefetching.  An  optimistic  upper 
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bound  for  performance  improvement  of  the  former  was  found  to  be  1071-.  whereas  the  latter 
results  in  an  overall  reduction  in  CPI  of  159c  and  improvement  for  individual  benchmarks 
of  as  much  as  60%. 

A  number  of  resource  allocation  issues  were  examined.  First,  performance  was 
compared  to  resource  requirements  for  functional  units  of  various  latencies.  For  example, 
a  2-cycle  add  unit  offers  only  a  2%  improvement  in  CPI  versus  a  3-cycle  design,  but  at  the 
expense  of  a  20%  increase  in  area.  However,  some  trade-offs  are  not  as  clear-cut,  such  as 
the  multiply  unit  decision  mentioned  above.  A  2-cycle  pipelined  multiply  unit  occupies 
twice  the  area  of  a  5-cycle  iterative  version,  but  achieves  a  10%  reduction  in  CPI.  Ultimate¬ 
ly,  integration  constraints  prompted  the  selection  of  the  slower,  smaller  multiply  unit.  A  dif¬ 
ferent  solution  might  be  reached  in  a  CMOS  design.  Second,  the  allocation  of  transistors 
was  examined  in  the  context  of  queue  and  reorder  buffer  entries.  Queue  entries  are  fairly 
inexpensive,  while  reorder  buffer  entries  are  more  costly  since  they  are  wider  and  require 
more  read  and  write  ports.  Two  to  three  entries  are  required  for  the  load  and  store  queues, 
which  corresponds  to  the  observation  that  most  applications  use  the  double  precision  for¬ 
mat,  requiring  two  loads  or  stores  per  operand.  Finally,  the  area  implications  for  each  of  the 
three  issue  policies  were  discussed.  As  mentioned,  in-order  issue  and  completion  is  the  sim¬ 
plest,  while  out-of-order  issue  and  completion  requires  substantially  more  storage  space 
than  the  other  policies. 

Instruction  and  data  queues  allow  the  integer  unit  to  slip  ahead  of  the  FPU,  since  the 
IPU  does  not  necessarily  need  to  stall  as  a  result  of  floating-point  data  dependencies  or  re¬ 
source  conflicts.  Decoupling  queues  also  serve  to  hide  latency  caused  by  chip  crossings  and 
can  lessen  the  impact  of  having  different  clock  frequencies  for  the  IPU  and  FPU,  a  situation 
that  can  result  from  differences  in  the  width  of  datapaths.  However,  the  use  of  queues  does 
make  support  of  precise  exceptions  more  difficult.  Characteristics  that  are  unique  to  both 
memory  and  floating-point  computational  exceptions  were  discussed  and  several  approach¬ 
es  are  presented  for  handling  these  issues  which  do  not  have  a  significant  impact  on  either 
chip  area  or  performance. 

In  order  to  reduce  pin  count,  the  existing  primary  data  cache  busses  were  also  used 
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for  transferring  instructions  and  data  between  the  IPL  and  FPL .  L  p  to  tv-  o  floating-point 
instructions  can  be  transferred  per  cycle,  which  is  the  bandwidth  needed  to  support  dual  is¬ 
sue  of  instructions  in  the  FPU.  Load  data  can  originate  from  several  sources,  including  the 
primary  data  cache,  the  write  cache,  the  prefetch  unit,  or  the  secondary  memor\’  system. 
Floating-point  store  instructions  are  somewhat  more  complex  than  their  integer  counter¬ 
parts,  primarily  because  the  data  to  be  stored  arrives  sometime  after  the  corresponding  in¬ 
struction.  Several  alternatives  were  proposed  for  handling  both  loads  and  stores. 
Altogether,  these  cases  have  been  implemented  in  efficient  manner  which  has  little  impact 
on  performance  and  which  requires  only  a  small  number  of  additional  I/O  pins. 

Achieving  high  performance  under  all  conditions  in  a  computer  having  integer  and 
floating-point  capability  requires  attention  to  a  number  of  details.  An  example  of  this  cov¬ 
ered  in  the  dissertation  is  the  use  of  result  busses.  As  more  parallelism  of  instruction  exe¬ 
cution  occurs  in  the  FPU,  more  demand  is  placed  on  the  bus  that  is  used  to  write  results 
from  an  execution  unit  to  the  reorder  buffer.  The  choice  of  using  two  result  busses  corre¬ 
lates  with  the  average  degree  of  issue  of  1.3  instructions  per  cycle.  The  benefit  of  providing 
hardware  support  for  square  root  is  also  analyzed.  This  operation  is  used  mostly  by  multi- 
media  applications,  in  which  it  is  used  to  translate  graphic-based  objects.  A  dedicated 
square  root  instruction  can  improve  the  performance  of  the  one  SPECfp92  benchmark  that 
relies  on  this  operation  by  50%,  and  overall  across  all  benchmarks  by  7%  to  9%.  Another 
trade-off  which  was  analyzed  was  floating-point  division,  which  can  be  implemented  by  re¬ 
ciprocal  algorithms  which  use  the  multiply  unit.  However,  the  impact  on  performance  in 
doing  so  is  quite  large,  since  frequently  occurring  multiply  instructions  cannot  issue  if  a  di¬ 
vision  operation  is  outstanding.  The  effect  of  miss-predicted  branches  on  floating-point 
code  was  also  considered.  Because  the  basic  block  size  for  floating-point  code  is  larger  than 
that  of  integer  code,  there  is  less  opportunity  for  branch  prediction  to  improve  performance. 
The  static  prediction  policy  used  by  the  Aurora  HI  architecture  degrades  CPI  on  floating¬ 
point  code  by  only  about  4%,  compared  to  a  perfect  prediction  policy;  integer  code  can  suf¬ 
fer  a  much  larger  penalty,  on  the  order  of  a  30%  degradation  in  CPI.  Finally,  simulation 
accuracy  and  run-time  speed  for  trace-driven  simulation  is  examined.  Several  benchmarks 
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are  run  for  both  the  first  50  million  and  the  fu'st  billion  instructions  and  several  metrics  arc 
used  to  compare  the  difference  in  results.  While  most  benchmarks  experience  only  a  feu 
percent  change  in  CPI,  one  program  (spice2g6)  does  see  a  159c  difference.  Consequently, 
sampling  should  be  used  for  reasons  of  both  accuracy  and  simulation  speed  (but  was  not 
used  in  the  Aurora  HI  simulator  due  to  time  constraints  for  implementation  and  verifica¬ 
tion). 

The  architectural  investigation  was  concluded  by  summarizing  the  current  state  of 
microprocessor  performance,  through  the  use  of  SPEC  ratings.  Five  versions  of  the  Aurora 
in  FPU  design  were  evaluated,  including  a  250MHz  baseline.  One  of  these  designs  empha¬ 
sized  the  extraction  of  instruction-level  parallelism  through  greater  complexity  while  an¬ 
other  focused  on  a  simpler  design  which  runs  at  a  faster  clock  frequency.  The  other  versions 
were  projections  of  the  baseline  in  light  of  expected  technology  improvements.  The  process 
technology  that  supports  the  FPU  design  has  not  changed  in  over  two  years.  However,  a 
newer  version  will  soon  be  available  and  should  decrease  the  average  loaded  gate  delay  by 
10%  to  40%.  The  final  Aurora  HI  design  achieves  a  SPEC  rating  of  300,  which  compares 
favorably  to  the  highest  performance  processors  currently  available  (SGI  TFP  =  3 10,  IBM/ 
Motorola  PowerPC  620  =  300,  IBM  Power2  =  274).  Overall,  improvements  in  clock  fre¬ 
quency  have  a  greater  impact  on  improving  floating-point  performance  than  do  architectur¬ 
al  or  algorithmic  improvements.  A  number  of  variables  which  are  not  reflected  in  these 
SPEC  ratings  are  also  mentioned,  including  better  code  reordering,  utilization  of  a  larger 
register  file,  support  for  double-word  loads  and  stores,  and  support  for  a  hardware  square 
root  instruction.  Together,  these  features  might  conservatively  improve  performance  by  an 
additional  20%  to  30%,  resulting  in  a  SPEC  rating  in  excess  of  400. 

While  many  of  the  architectural  trade-offs  presented  are  certain  to  be  familiar  inter¬ 
nally  to  industry  development  efforts,  this  dissertation  represents  a  unique,  comprehensive, 
and  accessible  summary  of  important  issues  for  supporting  high-performance  floating¬ 
point  execution. 
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6.3  Floating-Point  Implementation  Issues 

The  results  of  simulation  studies  were  applied  to  the  design  of  an  FPU  for  the  Au¬ 
rora  in  system  in  a  GaAs  DCFL  technology.  Adders  of  various  sizes  are  basic  components 
used  in  all  of  the  functional  units,  so  several  alternatives  were  evaluated.  A  cany'-skip  de¬ 
sign  was  not  chosen  since  the  small  fanin  that  is  a  characteristic  of  GaAs  increases  the  num¬ 
ber  of  gate  levels  needed  for  the  skip  logic.  Similarly,  a  group-4  carry  look-ahead  adder 
requires  several  additional  gate-levels  compared  to  the  Ling-modified  carry-select  ap¬ 
proach  that  was  selected.  Much  of  the  motivation  for  the  functional  unit  designs  originated 
with  work  done  elsewhere,  but  has  been  extended  in  order  to  accommodate  differences  that 
result  from  using  GaAs.  Also,  a  number  of  corrections  to  the  original  references  were  dis¬ 
covered  during  the  verification  of  these  units.  More  than  5  million  random  test  vectors  were 
performed  for  all  computational  units.  New  implementations  for  leading-one  prediction 
and  rounding  logic  in  the  add  unit  were  presented.  The  conversion  unit  that  is  described  is 
an  original  design  that  can  generate  any  of  6  conversion  operations  with  a  latency  of  2  cy¬ 
cles.  The  dissertation  also  examines  a  general  set  of  implementation  issues,  including  ap¬ 
proaches  for  supporting  precise  exceptions  without  incurring  a  performance  penalty,  ways 
of  handling  floating-point  loads  and  stores,  the  use  of  predecoding  to  reduce  critical  path 
depth,  and  design-for-test  features. 

6.4  CAD  Support  for  High  Performance  VLSI 
Designs 

The  dissertation  discusses  a  number  of  analysis  tools  that  have  been  developed  to 
provide  feedback  about  timing  and  functionality.  Many  of  these  utilities  are  based  on  a  de¬ 
lay  calculation  network  traversal  routine  obtained  from  Cascade  Design  Automation.  Ini¬ 
tially,  this  delay  calculator  was  updated  to  utilize  a  macro-model  approach  developed  by 
another  student  in  our  research  group.  Several  utilities  were  developed  to  aid  verification 
of  this  new  delay-calculator,  the  predictions  of  which  have  been  found  to  be  within  10%  of 
HSPICE  generated  delays  for  all  cases  and  within  7%  for  more  than  60%  of  cases.  The  der¬ 
ivation  of  delays  also  depends  on  accurate  parasitic  extraction;  several  utilities  were  written 
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to  address  accuracy  limitations  in  the  physical  design  system  that  we  used.  These  include 
the  capability  to  incorporate  capacitances  obtained  from  commercial  extraction  tools  into 
the  delay  calculation  routines  and  the  ability  to  analyze  and  adjust  interconnect  resistance 
through  wire  sizing  and  specification  of  which  layers  are  be  used  during  routing. 

The  first  utility  based  on  the  delay  traversal  routine  addresses  a  certain  type  of  tim¬ 
ing  hazard  that  can  result  from  a  two  phase  level-sensitive  clocking  scheme  and  which  can 
limit  the  frequency  at  which  a  chip  will  operate.  More  rigorous  nomenclature  can  aid  a  de¬ 
signer  in  recognizing  these  cases,  but  this  is  often  difficult  because  the  logic  can  be  quite 
complex.  To  ensure  completeness,  this  utility  provides  an  automated  approach  for  identi¬ 
fying  these  errors. 

Several  programs  were  written  to  help  analyze  the  clock  distribution  networks  of 
the  Aurora  HI  chips.  The  primary  one  is  also  based  on  the  delay  traversal  routines  and  gen¬ 
erates  a  ready-to-run  HSPICE  netlist  for  each  clock  phase  of  the  design.  As  with  other  de¬ 
lay-calculator  based  routines,  capacitances  generated  outside  of  the  Cascade  environment 
(via  Dracula  or  Mentor  Graphics)  can  be  incorporated  into  the  output  netlist.  After  HSPICE 
is  run  for  both  clock  phases,  several  analysis  scripts  are  used  to  provide  different  ways  of 
analyzing  the  data,  including  sorted  2D  plots  of  skew,  3D  plots  of  clock  transit  times  versus 
chip  location,  and  a  textual  listing  of  latch-to-latch  skew. 

To  complement  the  Cascade  timing  analyzer  for  the  analysis  of  logic  depth,  a  lev- 
elizer  based  on  the  delay  traversal  routine  was  written.  The  use  of  a  two  phase  latch-based 
clocking  scheme  allows  borrowing  time  from  the  clock  phase  which  precedes  a  critical 
path.  The  levelizer  generates  3D  histograms  for  which  the  x-axis  is  the  level  of  the  previous 
worst-case  path,  the  y-axis  is  the  level  of  the  current  maximum  path,  and  the  z-axis  repre¬ 
sents  the  number  of  instances  which  have  this  previous-current  pair.  An  additional  option 
will  print  the  worst-case  path  for  the  phase  which  succeeds  the  current  one;  this  is  beneficial 
when  moving  logic  across  latches  while  performing  manual  retiming.  The  levelizer  also 
generates  a  list  of  the  50  instances  that  appear  the  most  frequently  along  all  paths,  allowing 
the  user  to  focus  on  the  most  troublesome  logic  blocks. 
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Several  additional  functions  were  added  to  an  existing  post-processing  optimiza¬ 
tion  utility  obtained  from  Cascade.  First,  automatic  buffer  selection  can  be  performed  by 
running  a  stand-alone  version  of  the  delay-calculator.  The  resulting  text  database  of  delays, 
capacitance,  and  fanout  is  then  utilized  by  the  optimizer  to  translate  instances  with  large 
delays  into  buffered  versions  which  have  the  same  functionality.  Second,  the  optimizer  im¬ 
proves  gate-depth  by  searching  for  certain  common  logic  sequences.  This  pattern-matching 
approach  recognizes  redundant  logic  that  can  occur  between  the  interface  of  two  logic 
blocks.  This  utility  also  merges  the  drivers  for  multiple  latches  into  just  one  instance,  con¬ 
strained  by  a  user-specified  maximum  fanout.  In  the  FPU,  this  optimization  removed  ap¬ 
proximately  900  latch  drivers. 

Several  smaller  utilities  were  written  to  check  beta  ratios,  size  power  rails,  and  de¬ 
termine  power  dissipation.  Under  the  category  of  future  work,  several  ideas  are  also  pre¬ 
sented  concerning  an  automated  approach  for  performing  timing  analysis  and  logic 
optimization.  The  methodology  discussed  consists  of  the  following  steps;  1)  levelization  in 
order  to  identify  paths  which  exceed  the  targeted  gate  depth,  2)  optimal  logic  synthesis  for 
logic  which  has  less  then  10  inputs/outputs  (the  majority  of  logic  within  a  design  is  gener¬ 
ated  by  a  relatively  small  number  of  inputs  and  produces  a  small  number  of  outputs),  3)  log¬ 
ic  retiming  across  latch  boundaries,  and  4)  close  coupling  for  the  steps  of  parasitic 
extraction,  delay  calculation,  static  timing  analysis,  and  buffer  sizing/selection.  None  of 
these  steps  are  innovative,  but  the  overall  automated  system  can  significantly  improve  de¬ 
sign  time  and  reduce  tedious  operations  that  would  otherwise  be  performed  by  hand. 

The  dissertation  concludes  with  a  brief  discussion  of  functional  verification,  espe¬ 
cially  as  it  pertains  to  random  testing.  This  approach  focused  on  three  areas:  computation 
units,  chip-level,  and  system  level.  Several  different  simulation  environments  were  created 
to  support  generation  and  self-checking  validation  of  both  random  numerical  operations 
and  random  instruction  sequences.  Over  5  million  computations  were  simulated  for  each 
functional  unit  and  over  200  million  instructions  were  simulated  in  all. 
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Aurora  III  Chip  Layout 
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Appendix  B 

Corrections  to  Add  Unit 
Logic 

1  Generation  of  Guard,  Round,  and  Sticky  Bits 

The  nomenclature  used  for  the  following  equations  is  defined  in  [Quach91a], 
[Quach91c].  Primary  inputs  are  defined  as: 

expAbs  exponent  of  larger  operand 

G  intermediate  guard  bit  that  is  generated  from  alignment  shifter 

R  intermediate  round  bit  that  is  generated  from  alignment  shifter 

S  intermediate  sticky  bit  that  is  generated  from  alignment  shifter 

The  final  guard,  round,  and  sticky  bits  are  given  the  names,  respectively:  bn,  bnpl,  s. 
This  logic  is  necessary  because  the  alignment  shifter  does  not  zero  out  the  guard  and/or 
round  bits  for  a  shift  greater  than  the  width  of  the  mantissa  (53-bits).  As  a  result  of  the  hid¬ 
den  bit,  the  sticky  bit  may  need  to  be  set  for  a  shift  greater  than  the  mantissa  width;  this 
action  is  not  performed  by  the  alignment  logic  that  generates  the  intermediate  sticky  bit. 
The  revised  equations,  which  differ  from  the  those  presented  in  the  above  papers,  are: 

Es54  -  expAbs[5]  &  expAbs[4]  &  -expAbs[3]  &  expAbs[2]  &  expAbs[l]  &  -cjr- 
pAbs[0] 

EsgtSSA  =  expAbs[5]  &  expAbs[4]  &  -expAbs[3]  &  expAbs[2]  &  expAbsfl]  &  ex- 
pAbsfO] 

EsgtSSB  -  NAND{-expAbs[10:6]} 

EsgtSSC  =  expAbs[5]  &  expAbs[4]  &  expAbs[3] 
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EsgtSS  =  EsgtSSA  +  EsgtSSB  +  EsgtSSC 
Es09  =  AND(expAbs[JO:l  ]} 

bn  =  (-Es09  +  EslO)  &  ‘■Es54  &  -EsgtSS  &  G 

bnpl  -  EsS4  +  (-EsS4  &  (Es09  +  EsgtSS  +  -R)) 

s  =  (-Es09  &  -EsS4  &  -S)  +  (EsS4  &  R)  +  (EsS4  &  S)  +  EsgtSS 

2  Additional  Rounding  Logic 

The  nomenclature  used  for  the  following  equations  is  defined  in  [Quach91a], 
[Quach91c].  This  section  also  summarizes  revised  versions  for  some  of  the  logic  presented 
in  these  papers.  Primary  inputs  are  defined  as: 

addORsub  operation  (addition  =  0,  subtraction  =1) 

Eo  effective  operation,  considering  sign  of  operands  (addition  =  0,  subtraction 

1) 

Ccabar  czirry-out  of  mantissa  adder  (Cin  =  0  carry  tree) 

Es09  AND  of  the  complement  of  the  high  bits  of  larger  exponent 

EslObar  Isb  of  larger  exponent 

Cexp  carry-out  of  exponent  adder,  used  to  determine  which  operand  is  larger 

FsOO  msb  of  A+B+1  sum  of  mantissa  adder 

FslO  msb  of  A+B+1  sum  of  mantissa  adder 

551  XOR  of  Isb’s  of  input  operands 

552  XOR  of  lsb+  r  s  of  input  operands 

bn  guard  bit 

bnpl  round  bit 
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s  sticky  bit 

Sa  sign  of  RS  operand 

Sb  sign  of  RT  operand 

RM  rounding  mode  (RN  =  0,  RZ  =  1,  RP  =  2,  RM  =  3) 

Normalization  of  the  result  falls  into  several  classes:  one  right  shift  (ORS),  no  shift 
(NXS),  many  left  shift  (MLS),  one  left  shift  (OLS).  The  conditions  which  define  these 
classes  are: 

ORS  -  -Eo  &  -Ccabar; 

NXS  =  (-Eo  &  Ccabar)  +  ((Eo  &  -Es09)  &  ((FslO  &  -(bn  +  bnpl  +  s))  +  (FsOO  & 
(bn  +  bnpl  +  s)))) 

MLS  =  Eo&Es09 

OLS  =  (Eo  &  -Es09)  &  ((-FslO  &  -(bn  +  bnpl  +  s))  +  (-FslO  &  (bn  +  bnpl  + 

The  result  sign  is  defined  as: 

Se  =  (Cexp  &  Sa)  +  (-Cexp  &  -Ccabar  &  -Sb  &  addORsub)  +  (-Cexp  &  -Ccabar  & 
Sb  &  -addORsub)  +( -Cexp  &  Ccabar  &  Sa) 

For  each  of  the  rounding  modes,  two  results  are  produced:  CinRX,  which  indicates 
whether  a  round  is  needed,  and  qRX  which  is  the  least-significant-bit  to  be  shifted  in  during 
a  normalization. 

Round-to-nearest: 

A  =Eo&  Es09  &  EslObar  &  (-bn  +  FsOO)  &  (S52  +  -bn) 

B  =  (-Eo  &  S52  &  (bn  +  bnpl  +s  +  S51))  +  (Eo  &  Es09  &  -EslObar); 

C  =  —Eo  &  bn  &  (bnpl  +  s  +  S52); 

D  -  (Eo  &  -Es09  &  (-bn  +  (bn  &  -bnpl  &  -s  &  S52)))  +  (Eo  &  Es09  &  -EslObar 
&bn&  S52); 

E  =  Eo  &  -Es09  &  -bn  &  (-bnpl  +  -s); 
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CinRN  =  A  +  (-Ccabar  &  B)  +  (Ccabar  &C)  +  (FsOO  &  D)  +  {~FsOO  &  E) 
qRN  =  (-bn  &  bnpl  &  s)  +  (bn  &  -bnpl) 

Round-to-Zero: 

CinRZ  =  (-Es09  &  Eo  &  -(bn  +  bnpl  +  s))  +  (Es09  &  Eo  &  -bn  &  -Ccabar); 
qRZ  =  bn  ^(s  +  bnpl); 

Round-to-plus-infinity : 

roundMLSenCin  =  -bn  +  (-EslObar  +  FsOO)  &  (-EslObar  +  Ccabar) 

CinRPpNXS  =  ((((Eo  &  (Se  +  (-bn  &  -bnpl  &  -s))))  +  (-Eo  &(-Se&  (bn  +  bnpl 

AS))))); 

CinRPppNXS  =  NXS  &  S52  &  (CinRPpNXS  Ccabar) 

CinRPpOLS  =  (((-Se  &  -bn)  +  (Se  &  -bn  &  -bnpl  &  -s))) 

CinRPppOLS  =  OLS  &  CinRPpOLS  &  S52 

CinRPpMLS  =  MLS  &  roundMLSenCin  &  ((-Se  &  bn  &  S52  &  -Ccabar)  +  (Eo  & 
-bn  &  S52  &  -Ccabar)) 

CinRP  -  (ORS  &  -Se  &  (S52  +  An  +  bnpl  +  +  CinRPppNXS  +  CinRPppOLS  + 

CinRPpMLS 

qRP  =  (-Se  &  bn)  +  (Se  &  ((-bn  &s)  +  (-bn  &  bnpl)  +  (bn  &  -bnpl  &  -s)))  +  (MLS 
&bn) 

Round-to-minus-infinity; 

CinRMpNXS  =  ((((Eo  &  (Se  +  (-bn  &  -bnpl  &  -s))))  +  (-Eo  &  (Se  &  (bn  +  bnpl  + 
s))))); 

CinRMppNXS  =  NXS  &  S52  &  (CinRMpNXS  Ccabar) 

CinRMpOLS  =  (((Se  &  -bn)  +  (-Se  &  -bn  &  -bnpl  &  -s))) 

CinRMppOLS  -  OLS  &  CinRMpOLS  &  S52 
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CinRMpMLS  =  MLS  &  roundMLSenCin  «&  ((Se  &  Eo  &  S52  &  -Ccabar)  +  {Eo  &. 
~bn  &  S52  &  -Ccabar)) 

CinRM  =  (ORS  &  Se  &  (S52  +  bn  +  bnpl  +  5^;  +  CinRMppNXS  +  CinRMppOLS  + 
CinRMpMLS 

qRM  =  (Se  <£  +  (-Se  &  ((-bn  &s)->r  (-bn  &  bnpl )  +  (bn&  -bnpl  &  ~s)))  +  (MLS 

&  bn) 

Finally,  a  signal  is  needed  to  determine  whether  the  output  of  the  mantissa  adder  should 
be  complemented: 

ComplSORPpNXS  =  -S52  &  -(CinRPpNXS  Ccabar) 

ComplSORMpNXS  =  -S52  &  -(CinRMpNXS  ^  Ccabar) 

roundMLSenComplSO  —  (bn  &  EslObar  &  -S52  &  FsOO)  +  (bn  &  EslObar  &  -S52 

&  Ccabar) 

ComplSOpMLS  =  MLS  &  ((-Eo  &  -bn  &  -RM[0]  &  -S52)  +  (Se  &  bn  &  -RM[0]  & 
-S52)  +  (-RM[0]  &  -S52  &  Ccabar)  +  (RMfOJ  &  -S52  &  Ccabar) 
+  (-Se  &bn&  RM[0]  &  -S52)  +  (-Se  &  -Eo  &  RM[0]  &  -S52)  + 
roundMLSenComplSO) 

ComplSO  =  (RM[0]  &  OLS  &  -(CinRMpOLS  +  S52))  +  (-RM[0]  &  OLS  &  -(CinR- 
PpOLS  +  S52))  +  (-RM[0]  &  ComplSORPpNXS  &  NXS)  +  (RM[0]  & 
ComplSORMpNXS  &  NXS)  +  ComplSOpMLS 


Appendix  C 

References  used  for  plots  of 
clock  frequency  vs.  year 
and  transistor  count  vs. 
year 


[Benschneider89]  [Benschneider89]  [Binnan90]  [Darley90]  [Elkind87]  [Fos- 
sum85]  [Fuccio88]  [Gavrielov86]  [Gosling81]  [Ho85]  [Ide92]  [Jouppi88]  [Jouppi89] 
[Jouppi89]  [Kaneko89]  [Kasai85]  [Kohn89]  [Komal85]  [Komori89]  [Kawakami86]  [Ka- 
wasaki89]  [Lu88]  [McAllister86]  [Molnar89]  [Montoye90]  [Nakayama89]  [Okamoto91] 
[Oehler90]  [Papamichalis88]  [Rowen88]  [Schutz91]  [Shimazu89]  [Sit89]  [Sohie88]  [Stav- 
er87]  [Steiss91]  [Takeda85]  [Takla84]  [Taylor90]  [Tran85]  [Troutman86]  [Ware82] 
[Ware84]  [Wolrich84] 
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